DEGRUYTER = — Wide 


Arak M. Mathai, Hans J. Haubold 
PROBABILITY 
AN D STATISTICS 


COURSE FOR PHYSICISTS AND ENGINEERS 


Arak M. Mathai and Hans J. Haubold 
Probability and Statistics 
De Gruyter Textbook 


Also of Interest 


Linear Algebra. A Course for Physicists and Engineers 
Arak M. Mathai, Hans J. Haubold, 2017 

ISBN 978-3-11-056235-4, e-ISBN (PDF) 978-3-11-056250-7, 
e-ISBN (EPUB) 978-3-11-056259-0 


Probability Theory and Statistical Applications. 

A Profound Treatise for Self-Study 

Peter Zórnig, 2016 

ISBN 978-3-11-036319-7, e-ISBN (PDF) 978-3-11-040271-1, 
e-ISBN (EPUB) 978-3-11-040283-4 


Peler rni 
PROBABILITY THEORY 


Brownian Motion. An Introduction to Stochastic Processes 
René L. Schilling, Lothar Partzsch, 2014 

ISBN 978-3-11-030729-0, e-ISBN (PDF) 978-3-11-030730-6, 
e-ISBN (EPUB) 978-3-11-037398-1 


Numerical Analysis of Stochastic Processes 


NUMERICAL 


ANALYSIS Wolf-Jürgen Beyn, Raphael Kruse, 2018 


OF STOCHASTIC , 
PROCESSES 


ISBN 978-3-11-044337-0, e-ISBN (PDF) 978-3-11-044338-7, 
e-ISBN (EPUB) 978-3-11-043555-9 


Compressive Sensing. 

Applications to Sensor Systems and Image Processing 
Joachim Ender, 2018 

ISBN 978-3-11-033531-6, e-ISBN (PDF) 978-3-11-033539-2, 
e-ISBN (EPUB) 978-3-11-039027-8 


Arak M. Mathai and Hans |. Haubold 


Probability and 
Statistics 


A Course for Physicists and Engineers 


DE GRUYTER 


Mathematics Subject Classification 2010 
60-00, 60-01, 60A05, 60D05, 60E05, 62-00, 62-01, 62Ρ30, 62P35 


Authors 

Prof. Dr. Arak M. Mathai Prof. Dr. Hans J. Haubold 

McGill University United Nations Office for Outer Space Affairs 
Department of Mathematics and Statistics Vienna International Centre 

805 Sherbrooke St. West P.O. Box 500 

Montreal, QC H3A 2K6 1400 Vienna 

Canada Austria 

mathai@math.mcgill.ca hans.haubold@gmail.com 


ISBN 978-3-11-056253-8 
e-ISBN (PDF) 978-3-11-056254-5 
e-ISBN (EPUB) 978-3-11-056260-6 


COEEETN 


This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 
License. For details go to http://creativecommons.org/licenses/by-nc-nd/4.0/. 


Library of Congress Cataloging-in-Publication Data 
A CIP catalog record for this book has been applied for at the Library of Congress. 


Bibliographic information published by the Deutsche Nationalbibliothek 
The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; 
detailed bibliographic data are available on the Internet at http://dnb.dnb.de. 


© 2018 Arak M. Mathai, Hans |. Haubold, published by Walter de Gruyter GmbH, Berlin/Boston. 
The book is published with open access at www.degruyter.com. 

Typesetting: VTeX, Lithuania 

Printing and binding: CPI books GmbH, Leck 

Cover image: Pasieka, Alfred / Science Photo Library 

Printed on acid-free paper 

Printed in Germany 


www.degruyter.com 


Due to great interest shown by students, a proposal was made to the Department of 
Science and Technology, Government of India, New Delhi, India (DST), for conduct- 
ing a sequence of 10-day mathematics camps for undergraduates during holidays so 
that in one sequence of four camps the essential basic mathematics could be covered. 
DST approved this project with full financial support for 30 candidates in each camp. 
Principals of colleges in Kerala, India, were asked to select up to 5 motivated students 
from each college. The camp is run at CMSS from 08:30 am until 6:00 pm, contin- 
uously for 10 days of nearly 40 hours of lectures and 40 hours of problem-solving 
sessions. In one camp, Modules 1, 2, 3, on basic linear algebra for all disciplines, are 
covered. In a second camp, Modules 4 and 5, covering the topics of sequences, limits, 
continuity and differential calculus and the basic integral calculus, are covered. In a 
third camp, Module 6, covering the topics of sample space, probability, random vari- 
ables, expected values, statistical distributions and their applications, are covered. 
In the fourth camp, Modules 7 and 9, covering the topics of sampling distributions, 
statistical inference, prediction and model building, design of experiments and some 
non-parametric tests, are covered. 

The basic sequence of four camps, covering basic mathematics, probability and 
statistics, were repeated over the years from 2007 to 2014. There were also a four to 
six weeks course every year at the research level, meant for MSc and PhD students, 
and young faculty, known as SERC (Science and Engineering Research Council of the 
Department of Science and Technology, Government of India, New Delhi) Schools. The 
notes from these schools were brought out by CMSS in book form every year. Selected 
notes from the SERC schools of 1995 to 2006 was brought out as Special Functions 
for Applied Scientists by Springer, New York in 2008, authored by A. M. Mathai and 
Hans J. Haubold. Selected notes from the SERC Schools of 2007 to 2012 appeared in 
2017 as a research level book by Springer, New York. Selected notes from the SERC 
Schools of 2013 to 2015 will appear as a research level book on Matrix Methods and 
Fractional Calculus by World Scientific Publishing, all by the same authors. 

The present book on probability and statistics consists of 16 chapters. Chapters 1 
to 9 cover the topics of random experiments, sample space, probability, how to assign 
probabilities to individual events, random variables, expected values, statistical dis- 
tributions, collections of random variables and the central limit theorem. The statis- 
tics part consists of Chapters 10 to 16 covering the topic of sampling distributions, 
point estimation, interval estimation, tests of hypotheses, prediction, regression and 
model building problems, design of experiments and analysis of variance, some non- 
parametric tests, questions and answers. 

All concepts in probability and statistics are explained properly and illustrated 
with real-life examples. The material is developed slowly so that the book can be used 
as a self-study material also. Proper interpretations of the concepts of variance, co- 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. ABARH] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https: //doi.org/10.1515/9783110562545-201 


γι — 


REGIONAL CENTRES FOR SPACE SCIENCE AND TECHNOLOGY EDUCATION ENS a REGIONAL CENTRES FOR SPACE SCIENCE AND TECHNOLOGY EDUCATION 


Satellite meteorology and global climate Space and atmospheric science 
Flecation carrióalun Eeleoatiog carrioalan 


prd 
Global Navigation 
Satellite Systems 


REGONAL CENTRES FOR SPACE SCIENCE AND TECHNOLOGY EDUCATION REGIONAL CENTRES FOR SPACE SCIENCE AND TECHNOLOGY EDUCATION 


Remote sensing and Satelite communications 
the geographic information system Feat sarriak 
Édaoalion carvioalan 


variance, correlation, multiple correlation, a statistical hypothesis, meaning of tests 
of a statistical hypothesis, difference between regression and correlation analysis, dif- 
ference between prediction and model building, etc. are given. An introduction to 
matrix-variate distributions such as matrix-variate Gaussian, matrix-variate gamma 
and matrix-variate beta are also given. 

Since 2004, the materialin this book was made available to UN-affiliated Regional 
Centres for Space Science and Technology Education, located in India, China, Mo- 
rocco, Nigeria, Jordan, Brazil and Mexico (http://www.unoosa.org/oosa/en/ourwork/ 
psa/regional-centres/index.html). 

Since 1988, the material was taken into account for the development of educa- 
tion curricula in the fields of remote sensing and geographic information systems, 
satellite meteorology and global climate, satellite communications, space and at- 
mospheric science and global navigation satellite systems (http://www.unoosa.org/ 
oosa/en/ourwork/psa/regional-centres/study. curricula.html). 

As such, the material was considered to be a prerequisite for applications, teach- 
ing and research in space science and technology. It was also a prerequisite for the 
nine-months post-graduate courses in the five disciplines of space science and tech- 
nology, offered by the Regional Centres on an annual basis to participants from all 194 
Member States of the United Nations. 

Since 1991, whenever suitable at the research level, the material in this book was 
utilized in lectures in a series of annual workshops and follow-up projects of the so- 


— VII 


called Basic Space Science Initiative of the United Nations (http://www.unoosa.org/ 
oosa/en/ourwork/psa/bssi/index.html). 

As such, the material was considered a prerequisite for teaching and research 
in astronomy and physics. Astronomy has a long history of exploiting observational 
data to estimate parameters and quantify uncertainty in physical models. Problems in 
astronomy propelled the development of many statistical techniques, from classical 
least squares estimation to contemporary methods such as diffusion entropy analysis. 
Late twentieth century advances in data collection, such as automation of telescopes 
and use of CCD cameras, resulted in a dramatic increase in data size and complexity, 
producing a surge in use and development of statistical methodology. Astronomers 
use these data sets for a diverse range of science goals, including modeling formation 
of galaxies, finding earth-like planets, estimating the metric expansion of space and 
classifying transients. This book reviews common data types and statistical method- 
ology currently in use in space science, with the goal of making applications more 
accessible to methodological and applied statisticians. Naturally, the courses focused 
on the analysis and modeling of observational data (image data, spectral data, func- 
tional data, time series) emanating from a main sequence star, the Sun (figure below), 
utilizing solar particle, photon and neutrino radiation. Exercises went even that far to 
analyze solar neutrino data coming to the convincing conclusion that the solar neu- 
trino flux is varying over time but not discovering a conclusive result of the physics 
that may drive such variation. 


The Space and Atmospheric Sciences education curricula provides opportunities 
to teach basic space science. The development of the education curricula (illustrated 
above) started in 1988 at UN Headquarters in New York, the specific GNSS curriculum 
emanated only in 1999 after the UNISPACE III Conference, held at and hosted by the 
United Nations at Vienna. 


CMSS A. M. Mathai and Hans J. Haubold 
15 June 2017 Peechi, Kerala, India 


Preface 


Upon requests from colleges in Kerala, India, the Centre for Mathematical and Statis- 
tical Sciences (CMSS) had decided to conduct a series of remedial courses on selected 
topics. These topics were suggested by the college teachers themselves so that by par- 
ticipating in these courses they could be better prepared to teach the material in their 
classes. A series of such courses were conducted by CMSS from 1985 to 2002 at its 
Trivandrum Campus, Kerala, India. The notes written up for such courses and then 
class-tested at the University of Texas at El Paso, USA, formed Modules 1, 2, 3. 
Modules Series of CMSS are meant for self-study. On selected basic topics in math- 
ematical sciences, materials are developed with a lot of illustrative examples, starting 
from the fundamentals. Examples are taken from real-life situations so that the stu- 
dents can see the relevance of mathematics in solving real-life problems. The subject 
matter is developed very slowly so that there is time for absorbing the materials. 
Modules 1, 2, 3 are on vectors, matrices, determinants and their applications. Mod- 
ule 4 is on limits, continuity, differentiability and differential calculus. Module 5 is on 
integrals and integration. Module 6 is on basic probability and random variables. This 
is Chapters 1 to 9 of the materials on basic probability and statistics. Module 7 is on 
statistics. Module 8 is on stochastic processes. Module 9 is about questions and an- 
swers, mainly the questions asked by the students over past years and their answers. 
Modules 1 to 9 are expected to cover the basic materials needed for a proper study of 
their own subjects for students from statistics, physics, engineering areas and other 
applied areas. This present material is a combined version of Modules 6, 7 and 9. 


CMSS A. M. Mathai and Hans J. Haubold 
15 June 2017 Peechi, Kerala, India 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. EABAR] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https: //doi.org/10.1515/9783110562545-202 


Acknowledgement 


Dr B. D. Acharya (passed away), Dr H. K. N. Trivedi (now retired), Dr P. K. Malhotra and 
Dr Ashok K. Singh of the Mathematical Sciences Division of DST took the initiative in 
getting the funds released for the undergraduate training programs and SERC Schools 
of CMSS. CMSS would like to express its gratitude to these visionaries. A lot of people 
have contributed, directly or indirectly, in making CMSS Modules in their final printed 
forms. When Dr A. M. Mathai had written up the notes, he had set it in Plain TeX. 
Dr Shanoja R. Naik (Shanoja Pai) and Dr T. Princy of CMSS spent a lot of time in set- 
ting all the figures in the modules and translating to LaTeX and they deserve special 
thanks. ΑΙΙ the PhD scholars at CMSS, Dr Nicy Sebastian, Dr Seema S. Nair, Dr Dhan- 
nya P. Joseph, Dr Dilip Kumar, Dr Naiju M. Thomas, Dr Anitha Thomas, Dr Sona Jose 
and Dr Ginu Varghese put in their time and efforts to make the Modules in the final 
format. Ms Sini Devassy did all the photo shop work for making the Modules ready for 
printing. Ms R. Girija looked after the office matters. The authors would like to thank 
all of them for their sincere and dedicated efforts. 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. JBA] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https: //doi.org/10.1515/9783110562545-203 


Contents 


Preface — IX 


Acknowledgement — XI 


List of Tables — XIX 


List of Symbols — XXI 


1 


Random phenomena — 1 
Introduction —— 1 
Waiting time — 1 
Random events — 2 
A random experiment — 7 
Venn diagrams — 14 
Probability or chance of occurrence of a random event — 18 
Postulates for probability — 18 
How to assign probabilities to individual events? — 22 
Buffon’s “clean tile problem” — 29 


Probability — 33 
Introduction — 33 
Permutations — 33 
Combinations — 37 
Sum Σ and product [| notation — 41 
The product notation or pi notation —— 45 
Conditional probabilities —— 48 
The total probability law —— 53 
Bayes’ rule — 55 
Entropy — 57 


Random variables — 61 
Introduction — 61 
Axioms for probability/density function and distribution 
functions — 66 
Axioms for a distribution function — 72 
Mixed cases — 72 
Some commonly used series — 76 
Exponential series — 76 
Logarithmic series —— 76 


XIV ---- Contents 


3.3.3 Binomial series — 77 

3.3.4 Trigonometric series — 78 

3.3.5 A note on logarithms —— 78 

4 Expected values — 83 

4.1 Introduction — 83 

4.2 Expected values — 83 

4.2.1 Measures of central tendency — 88 

4.2.2 Measures of scatter or dispersion —— 91 

4.2.3 Some properties of variance — 91 

4.3 Higher moments — 95 

4.3.1 Moment generating function — 100 

4.3.2 Moments and Mellin transforms —— 102 

4.3.3 Uniqueness and the moment problem — 104 
5 Commonly used discrete distributions —— 107 

5.1 Introduction — 107 

5.2 Bernoulli probability law —— 107 

5.3 Binomial probability law —— 109 

5.4 Geometric probability law —— 111 

5.5 Negative binomial probability law —— 113 

5.6 Poisson probability law —— 117 

5.6.1 Poisson probability law from a process —— 121 
5.7 Discrete hypergeometric probability law —— 123 
5.8 Other commonly used discrete distributions —— 125 
6 Commonly used density functions — 133 

6.1 Introduction — 133 

6.2 Rectangular or uniform density — 133 

6.3 A two-parameter gamma density — 136 
6.3.1 Exponential density — 139 

6.3.2 Chi-square density — 139 

6.4 Generalized gamma density — 141 

6.4.1 Weibull density — 142 

6.5 Beta density — 143 

6.6 Laplace density — 146 

6.7 Gaussian density or normal density — 147 
6.7.1 Moment generating function of the normal density —— 151 
6.8 Transformation of variables —— 153 

6.9 A note on skewness and kurtosis —— 159 

6.10 Mathai's pathway model — 161 


6.10.1 Logistic model — 162 


Contents — XV 


6.11 Some more commonly used density functions —— 163 

7 Joint distributions — 171 

7.1 Introduction — 171 

7.2 Marginal and conditional probability/density functions — 176 

7.2.1 Geometrical interpretations of marginal and conditional 
distributions — 179 

7.3 Statistical independence of random variables — 182 

7.4 Expected value — 185 

7.4.1 Some properties of expected values — 189 

7.4.2 Joint moment generating function —— 191 

7.4.3 Linear functions of random variables — 195 

7.4.4 Some basic properties of the correlation coefficient —— 199 

7.5 Conditional expectations — 201 

7.5.1 Conditional expectation and prediction problem — 204 

7.5.2 Regression — 206 

7.6 Bayesian procedure —— 207 

7.7 Transformation of variables — 211 

8 Some multivariate distributions — 221 

8.1 Introduction — 221 

8.2 Some multivariate discrete distributions — 221 

8.2.1 Multinomial probability law —— 221 

8.2.2 The multivariate hypergeometric probability law —— 224 

8.3 Some multivariate densities — 227 

8.3.1 Type-1 Dirichlet density — 227 

8.3.2 Type-2 Dirichlet density — 230 

8.4 Multivariate normal or Gaussian density — 233 

8.4.1 Matrix-variate normal or Gaussian density — 242 

8.5 Matrix-variate gamma density — 243 

9 Collection of random variables — 249 

9.1 Introduction — 249 

9.2 Laws of large numbers — 252 

9.3 Central limit theorems — 253 

9.4 Collection of dependent variables —— 257 


10 Sampling distributions — 259 


10.1 Introduction —— 259 
10.2 Sampling distributions — 262 
10.3 Sampling distributions when the population is normal — 272 


10.4 Student-t and F distributions —— 277 


XVI ---- Contents 


10.5 Linear forms and quadratic forms —— 284 

10.6 Order statistics — 291 

10.6.1 Density of the smallest order statistic x,., — 293 

10.6.2 Density of the largest order statistic x,.,, —— 293 

10.6.3 The density of the r-th order statistic x,., — 294 

10.6.4 Joint density ofthe r-th and s-th order statistics Χῃ., and x,., — 298 


11 Estimation — 303 

11.1 Introduction — 303 

11.2 Parametric estimation —— 305 
11.3 Methods of estimation — 320 


11.3.1 The method of moments —— 321 
11.3.2 The method of maximum likelihood — 323 


11.3.3 Method of minimum Pearson's X? statistic or minimum chi-square 
method —— 329 

11.3.4 Minimum dispersion method —— 330 

11.3.5 Some general properties of point estimators —— 332 

11.4 Point estimation in the conditional space — 337 


11.4.1 Bayes’ estimates —— 337 
11.4.2 Estimation in the conditional space: model building — 340 
11.4.3 Some properties of estimators — 342 


11.4.4 Some large sample properties of maximum likelihood 
estimators — 344 
11.5 Density estimation —— 349 


11.5.1 Unique determination of the density/probability function — 349 
11.5.2 Estimation of densities — 350 


12 Interval estimation —— 355 

12.1 Introduction — 355 

12.2 Interval estimation problems —— 355 

12.3 Confidence interval for parameters in an exponential population — 358 
12.4 Confidence interval for the parameters in a uniform density — 360 
12.5 Confidence intervals in discrete distributions — 361 

12.5.1 Confidence interval for the Bernoulli parameter p — 362 

12.6 Confidence intervals for parameters in N(u,07) —— 365 


12.6.1 Confidence intervals for y — 366 
12.6.2 Confidence intervals for a? in N(u, o?) — 369 


12.7 Confidence intervals for linear functions of mean values —— 372 

12.7.1 Confidence intervals for mean values when the variables are 
dependent —— 373 

12.7.2 Confidence intervals for linear functions of mean values when there is 


statistical independence — 375 


12.7.3 


13 
13.1 
13.2 
13.2.1 
13.3 


13.3.1 
13.3.2 
13.3.3 
13.4 
13.5 


13.6 
13.6.1 
13.6.2 
13.7 


13.7.1 
13.7.2 
13.7.3 
13.8 

13.8.1 
13.9 

13.9.1 
13.9.2 
13.9.3 
13.9.4 
13.9.5 
13.9.6 


14 
14.1 
14.2 
14.2.1 
14.2.2 
14.2.3 
14.2.4 
14.3 
14.3.1 
14.3.2 


Contents —- XVII 


Confidence intervals for the ratio of variances —— 377 


Tests of statistical hypotheses — 381 
Introduction — 381 
Testing a parametric statistical hypothesis — 382 
The likelihood ratio criterion or the A-criterion —— 391 
Testing hypotheses on the parameters of a normal population 
N(u, o?) — 395 
Testing hypotheses on p in Νίμ, 0?) when σ’ is known — 395 
Tests of hypotheses on p in N(u, a?) when σ2 is unknown — 396 
Testing hypotheses on o? ina N(u, o?) — 399 
Testing hypotheses in bivariate normal population — 401 
Testing hypotheses on the parameters of independent normal 
populations — 404 
Approximations when the populations are normal — 408 
Student-t approximation to normal —— 408 
Approximations based on the central limit theorem —— 409 
Testing hypotheses in binomial, Poisson and exponential 
populations —— 411 
Hypotheses on p, the Bernoulli parameter —— 411 
Hypotheses on a Poisson parameter —— 413 
Hypotheses in an exponential population —— 414 
Some hypotheses on multivariate normal — 416 
Testing the hypothesis of independence — 418 
Some non-parametric tests —— 419 
Lack-of-fit or goodness-of-fit tests — 420 
Test for no association in a contingency table —— 425 
Kolmogorov-Smirnov statistic D, —— 428 
The sign test — 430 
The rank test — 431 
The run test — 433 


Model building and regression —— 437 
Introduction —— 437 
Non-deterministic models —— 437 
Random walk model — 438 
Branching process model — 439 
Birth and death process model — 440 
Time series models —— 440 
Regression type models — 440 
Minimization of distance measures — 441 
Minimum mean square prediction —— 442 


XVIII —— Contents 


14.3.3 Regression on several variables —— 447 

14.4 Linear regression —— 450 

14.4.1 Correlation between x, and its best linear predictor — 457 

14.5 Multiple correlation coefficient p, 2. ,, —— 459 

14.5.1 Some properties of the multiple correlation coefficient —— 459 

14.6 Regression analysis versus correlation analysis — 466 

14.6.1 Multiple correlation ratio —— 466 

14.6.2 Multiple correlation as a function of the number of regressed 
variables — 468 

14.7 Estimation of the regression function —— 475 

14.7.1 Estimation of linear regression of y on x — 476 

14.7.2 Inference on the parameters of a simple linear model — 481 

14.7.3 Linear regression of y on x,,..., x, — 484 


14.7.4 General linear model — 489 


15 Design of experiments and analysis of variance — 493 

15.1 Introduction — 493 

15.2 Fully randomized experiments — 494 

15.2.1 One-way classification model as a general linear model — 498 


15.2.2 Analysis of variance table or ANOVA table — 499 
15.2.3 Analysis of individual differences — 501 


15.3 Randomized block design and two-way classifications —— 503 
15.3.1 Two-way classification model without interaction — 505 
15.3.2 Two-way Classification model with interaction —— 510 

15.4 Latin square designs — 516 

15.5 Some other designs — 521 

15.5.1 Factorial designs —— 521 

15.5.2 Incomplete block designs — 522 

15.5.3 Response surface analysis — 522 


15.5.4 Random effect models — 522 


16 Questions and answers — 525 

16.1 Questions and answers on probability and random variables —— 525 
16.2 Questions and answers on model building —— 539 

16.3 Questions and answers on tests of hypotheses —— 549 


Tables of percentage points — 559 
References — 577 


Index — 579 


List of Tables 


Table 1 
Table 2 
Table 3 
Table 4 
Table 5 
Table 6 
Table 7 
Table 8 
Table 9 


Binomial coefficients — 559 

Cumulative binomial probabilities —— 560 
Cumulative poisson probabilities —— 565 
Normal probabilities —— 571 

Student-t table, right tail —— 572 

The chi-square table, right tail —— 573 
F-distribution, right tail, 5% points —— 574 
F-distribution, right tail, 1% points —— 575 
Kolmogorov-Smirnov D, 576 


List of Symbols 


To a set (Section 1.1, p. 2) 

U,n union, intersection of sets (Section 1.1, p. 4) 

P(A) probability of the event A (Section 1.4, p. 18) 
P(nr),,P, number of permutations (Section 2.2, p. 33) 

(α)κ Pochhammer symbol (Section 2.2, p. 36) 

(7) number of combinations (Section 2.3, p. 38) 

X sum notation (Section 2.4, p. 41) 

II product notation (Section 2.4, p. 41) 

P(A|B) conditional probability (Section 2.5, p. 49) 

Pr{x < a} probability of the event {x < x} (Section 3.1, p. 61) 
F(x) distribution function (Section 3.1, p. 65) 

Ε[ϕ(χ)] expected value (Section 4.2, p. 83) 

μί r-th moment about the origin (Section 4.3, p. 95) 

μ, r-th central moment (Section 4.3, p. 96) 

Hir factorial moment (Section 4.3, p. 96) 

M(t) moment generating function (Section 4.3.1, p. 100) 
My(s) Mellin transform of f (Section 4.3.2, p. 102) 

I(a) gamma function (Section 6.1, p. 134) 

B(a,p) beta function (Section 6.4.1, p. 142) 

~ distributed as (Section 6.7, p. 149) 

Ν(µ, o?) normal or Gaussian population (Section 6.7, p. 149) 
dx ^ dy wedge product of differentials (Section 71, p. 171) 
Cov(x, y) covariance (Section 74, p. 187) 

M(t,,...,t,) moment generating function (Section 74.2, p. 191) 
p correlation coefficient (Section 74.3, p. 198) 

E(xly) conditional expectation (Section 7.5, p. 201) 

Cov(X) covariance matrix in the vector X (Section 8.2, p. 223) 
J Jacobian (Note 8.1, p. 235) 

dx wedge product of differentials in X (Note 8.1, p. 235) 
ΙΑ determinant of the matrix A (Note 8.1, p. 234) 

Τ p(&) real matrix-variate gamma (Section 8.5, p. 243) 

iid independently and identically distributed (Section 10.1, p. 259) 
mef moment generating function (Section 10.1, p. 259) 
N(0,1) standard normal (Section 10.2, p. 264) 

be chi-square variable (Section 10.2, p. 265) 

Var(-) variance of (-) (Note 10.4, p. 268) 

X4) non-central chi-square (Example 10.8, p. 275) 

t, Student-t (Definition 10.6, p. 277) 

Eun F-statistic (Definition 10.7, p. 280) 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. JBA] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https: //doi.org/10.1515/9783110562545-204 


XXII —— List of Symbols 


transpose of X (Section 10.5, p. 285) 

positive definite sigma (Note 10.11, p. 288) 

smallest order statistic (Section 10.6, p. 293) 
largest order statistic (Section 10.6, p. 293) 

r-th order statistic (Section 10.6, p. 294) 
information in the sample (Note 11.9, p. 334) 

null hypothesis (Section 13.2, p. 383) 

alternate hypothesis (Section 13.2, p. 383) 

most powerful test (Section 13.2, p. 387) 

uniformly most powerful test (Section 13.2, p. 388) 
Kolmogorov-Smirnov statistic (Section 13.9, p. 421) 
multiple correlation coefficient (Section 14.5, p. 459) 
multiple correlation ratio (Section 14.5, p. 467) 
analysis of variance (Section 15.2, p. 499) 


1 Random phenomena 


1.1 Introduction 


In day-to-day life, we come across several situations, some of which are deterministic 
in nature but most of them are non-deterministic in nature. We will give some general 
ideas of deterministic and non-deterministic situations first, and then we will intro- 
duce a formal definition to define random quantities. 

You got up in the morning and went for a morning walk for one hour. You had 
this routine and you had decided to go for a walk. The event that you went for a walk 
did not just happen but it was predetermined. You got ready and went to college as 
predetermined. If a normal human being consumes potassium cyanide or catches hold 
of a high voltage live electric wire, then the result is predetermined; the person dies 
or it is asure event. Such situations or events are predetermined events. But there are 
several events which cannot be determined beforehand. 


1.1.1 Waiting time 


For going to college, you had to wait for a bus, which was scheduled to arrive at 8 a.m. 
The bus usually does not leave until after 8 a.m. if it arrives to the stop earlier because 
the bus has to cater to the regular passengers from that stop. But the actual amount of 
time that you had to wait on a particular day for that bus was not under your control, 
which could not be predetermined because due to traffic congestion on the way or due 
to many other reasons the bus could be late. The waiting time on that day might be 
5 minutes or even 20 minutes. The waiting time then is a non-deterministic or random 
quantity. 

Suppose that you went to a hospital for a routine medical check-up. Suppose that 
the check-up consists of taking your weight and measuring height, checking your 
blood pressure, taking blood samples, waiting for the doctor's physical examination, 
waiting for the result of the blood test, waiting for a chest x-ray, etc. Thus the total wait- 
ing time T consists of several waiting times for the individual items which are all non- 
deterministic or random in nature or these are variables with some sort of randomness 
associated with them, where T is ofthe form T = tj + t, + --- + ty if tj is the waiting time 
for the j-th item, such as a blood test, and if there are k such items. If t;, j =1,2,...,k are 
all random quantities, then T itself is a random quantity or these variables (durations 
of waiting) are not of predetermined durations. If the check-up consisted of only three 
items, then the total waiting time would be of the form T = t; + t; + t5, where ¢;, j = 1,2,3 
are the individual waiting times. 

Suppose you take your car for service. Suppose that the service consists of an oil 
and filter change, checking and correcting fluid levels, etc. Here also, each component 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. ABARH] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https: //doi.org/10.1515/9783110562545-001 


2 —— 1 Random phenomena 


of the waiting time is of non-deterministic durations. The waiting time for the first 
child birth for a fertile woman from the time of co-habitation until child birth is a 
random quantity or of non-deterministic duration. The waiting time for a farmer for 
the first rainfall after the beginning of the dry season is a random quantity since it 
cannot be predetermined for sure. 


1.1.2 Random events 


Consider the event of occurrence of the first flood in the local river during a rainy sea- 
son, flood is taken as the water level being above a threshold level. Itis a random event 
because it cannot be predetermined. The event of a lightning strike or thunderbolt at a 
particular locality is a random event, and may be a rare event, but random in nature. 
The event of a snake bite in a particular village is a random event. If you are tying to 
swim across a river and if you have done this several times and if you are sure that you 
can successfully complete the swim, then your next attempt to swim across is a sure 
event and it is not a random event because there is no element of uncertainty about 
the success. If you tried this before and if you had a few occasions of failures and a 
few occasions of successes, then you will not be sure about your next attempt to swim 
across. Then the event of successful completion of the next attempt has become a ran- 
dom event. If you throw a stone into a pond of water and if your aim is to see whether 
the stone sinks in water, then it is deterministic in nature because you know about 
the physical laws and you know that the stone will sink in water. The outcome or the 
event is not random. Suppose that your aim is to see at which location on the pond 
where the stone will hit the water surface, then the location cannot be predetermined 
and the outcome or the event is a random event. If you throw a cricket ball upward, 
then the event that it comes down to Earth is not a random event because you know 
for sure that the ball has to come down due to the pull of gravity, but how high the ball 
will go, before starting to come down, is not predetermined, and hence it is a random 
quantity. 

With the above general ideas in mind, we will give a systematic definition for a 
random event and then define the chance of occurrence of a random event or the prob- 
ability of occurrence of a random event. To this end, we will start with the definition 
of a *random experiment", then we will proceed to define events of various types. 


Note 1.1. It is assumed that the students are familiar with the notations for a set and 
operations such as union, intersection and complementation on sets. Those who are 
familiar with these items may skip this note and proceed to the next section. For the 
sake of those who are not familiar or forgotten, a brief description is given here. 


(i) A set. It is a well-defined collection of objects. The objects could be anything. For ex- 
ample, if the objects are the numbers 2, -1,0 ina set A, then it is written as A = (2, —1, 0}, 


1.1 Introduction — 3 


that is, the objects are put within curly brackets. The order in which the objects are 
written is not important. We could have written the same set in different forms such 
as 


A-(2,-1,0] = {0, -1, 2} = {-1, 0, 2} (N1.1) 


and so on, in fact the 6 different ways in which the same set A of 3 objects can be 
written here. The objects are called “elements” of the set and written as 


2ΕΑ, -1¢€A, OcA, 54Α, 10¢A (N1.2) 


and they are read as 2 in A or 2 is an element of A, -1 in A, O in A, whereas 5 not in A 
or 5 is not an element of A, 10 not in A, etc. The symbol e indicates “element of". If B 
is another set, say, 


B-í(m,0,8,), mcB,0cB, 8cB, «cB (N1.3) 


then B contains the elements, the Latin letter m, the Greek letter 0 (theta), number 
8 and the symbol «. Thus the objects or the elements of a set need not be numbers 
always. Some frequently used sets are the following: 

N = {1,2,3,...} = the set of natural numbers or positive integers; 

No = {0,1,2,...} = the set of non-negative integers; 

R = {x | -co < x < oo} = the set of all real numbers, where the vertical bar “|” indi- 
cates “such that” or “given that”. It is read as “all values of x such that minus infinity 
less than x less than infinity”; 

A, = {x | -10 < x < 5} = [-10,5] = the closed interval from --10 to 5, closed means 
that both end points are included; 

A, = {x | -10 < x < 5} = (-10,5] = the semi-open (semi-closed) interval from --10 to 
5, open on the left and closed on the right. Usually we use a square bracket “[” or “]” 
to show where it is closed, and open bracket “(” or “)” to show where it is open. In 
this example, the left side is open or the point --10 is not included and the right side 
is closed or the point 5 is included; 

A; = {x | -10 € x < 5} = [-10,5) = interval from —10 to 5, closed on the left and open 
on the right; 

A, = {x | -10 < x < 5} = (-10,5) = open interval from -10 to 5; 

C = {a +ib | -co«a« co, -co« b < oo,i = V-1] = the set of all complex numbers. 


“|” 


(ii) Equality of sets. Two sets are equal if they contain the same elements. 
For example, if A = {2,7} and B = {x,7}, then A = B if and only if x = 2. 


(iii) Null set or vacuous set or empty set. If a set has no elements, then it is called a 
null set, and it is denoted by $ (Greek letter phi) or by a big O. 

For example, consider the set of all real solutions of the equation x^ = —1. We know 
that this equation has no real solution but it has two imaginary solutions i and -i with 


4 — 1 Random phenomena 


i? = (-i)* = -1. Thus the set of real solutions here is $ or it is an empty set. If the set 
contains a single element zero, that is, {O}, it is not an empty set. It has one element. 


(iv) Union of two sets. The set of all elements which belong to a set A, or to the set 
A, is called the union of the sets A, and A, and it is written as A, U A}. The notation U 
stands for “union”. 

For example, let A = {2,--1, 8} and B = {0,8,0,x} then 


AUB= {2,-1,8,6,0, x}. 


Here, 8 is common to both A and B, and hence it is an element of the union of A and 
B but it is not written twice. Thus all elements which are in A or in B (or in both) will 
be in AU B. Some examples are the following: 

For the sets N and Ny above, NUN, = No; 


A={x|-2<x<8}, B={x|3<x<10} => AUB={x|-2<x<10} 
where the symbol 55 means “implies”; 


A-íx|1xxx3, B={x|5<x<9} 


=> AUB={x|1<x<30r5<x<9}. 


(v) Intersection of two sets. The set of all elements, which is common to two sets A 
and B, is called the intersection of A and B and it is written as AN B, where the symbol 
n stands for “intersection”. 

Thus, for the same sets A = (2, —1, 8} and B = {0, 8, 0, κ} above, An B = {8} or the set 
containing only one element 8 because this is the only element common to both A and 
B here. Some examples are the following: 

For the sets N and Ng above, NN No =N = the set of all natural numbers: 


A={x|O<x<7}, B={x|5<x<8} => ANB={x|5<x<7}; 
A={x|-1<x<2}, Bz={x|4<x<12} => AnB-é$ 


which is an empty set because there is no element common to A and B. 


(vi) Subset of a set. A set C is said to be a subset of the set A if all elements of C are also 
elements of A and it is written as C c A and read as C is a subset of A or C is contained 
in A or A contains C. 

From this definition, it is clear that for any set B, B c B. Let A = (2, -1,8) and let 
C = {8,-1}, then C c A. Let D = (8, 1,5], then D is not a subset of A because D contains 
the element 5 which is not in A. Consider the empty set ¢, then from the definition it 
follows that 


¢ is a subset of every set; $ cA, $cC, PCD. 


1.1 Introduction — 5 


Consider the sets 
A={x|2<x<8}, B={x|3<x<5} => BcA. 


Observe that the set of all real numbers is a subset of the set of all complex numbers. 
The set of all purely imaginary numbers, {a + ib with a = 0}, is a subset of the set of all 
complex numbers. The set of all positive integers greater than 10 is a subset of the set 
of all positive integers. 


(vii) Complement of a set. Let D be a subset of a set A then the complement of D in 
A or that part which completes D in A is the set of all elements in A but not in D when 
Dis a subset of A, and it is usually denoted by D^ or D. We will use the notation D^. 

Let A = {-3, 0, 6,8, -11} and let D = (0,6, —11} then D° = {-3, 8}. Note that D c A and 
DUDS =A and Dn D° = ©. In general, 


BcA 3 BocA, BUBS=A and ΒΩΒΞφ. 
As another example, 
A={x|O<x<8}, B={x|O<x<3} => Bo={x|3<x<8} 


which is the complement of B in A. Consider another subset here, D = {x | 3 < x < 5}, 
then what is the complement of D in A? Observe that D is a subset of A. Hence D* 
consists of all points in A which are not in D. That is, 


Do = {x |O<x<3,5<x<8} 


or the union of the pieces O<x<3and5<x<8. 
The above are some basic details about sets and basic operations on sets. We will 
be concerned about sets which are called “events”, which will be discussed next. 


Exercises 1.1 


Classify the following experiments/events as random or non-random, giving justifica- 
tions for your statements. 


1.1.1. A fruits and vegetable vendor at Palai is weighing a pumpkin, selected by a cus- 
tomer, to see: 

(a) the exact weight; 

(b) whether the weight is more than 5 kg; 

(c) whether the weight is between 3 and 4 kg. 


1.1.2. For the same vendor in Exercise 1.1.1, a pumpkin is picked from the pile blind- 
folded and weighed to see (a), (b), (c) in Exercise 1.1.1. 


6 — 1 Random phenomena 


1.1.3. A one kilometer stretch of a road in Kerala is preselected and a reporter is check- 
ing to see: 

(a) the total number of potholes on this stretch; 

(b) the total number of water-logged potholes on this stretch; 

(c) the maximum width of the potholes on this stretch; 

(d) the maximum depth of the potholes in this stretch. 


1.1.4. For the same experiment in Exercise 1.1.3, a one kilometer stretch is selected by 
some random device and not predetermined. 


1.1.5. A child is trying to jump across a puddle of water to see whether she can suc- 
cessfully do it. 


1.1.6. Abhirami is writing a multiple choice examination consisting of 10 ques- 
tions where each question is supplied with four possible answers of which one is 
the correct answer to the question. She wrote the answers for all the 10 questions, 
where: 

(a) she knew all the correct answers; 

(b) she knew five of the correct answers; 

(c) she did not know any of the correct answers. 


1.1.7. A secretary in an office is doing bulk-mailing. The experiment is to check and 
see how many pieces of mail are sent: 

(a) without putting stamps; 

(b) with an insufficient value of stamps; 

(c) with wrong address labels. 


1.1.8. A secretary is typing up a manuscript. The event of interest is to see on average 
that the number of mistakes per page is 

(a) zero; 

(b) less than 2; 

(c) greater than 10. 


1.1.9. A dandelion seed is flying around in the air with its fluffy attachment. The event 
of interest is to see: 

(a) how high the seed will fly; 

(b) when the seed will enter over your land; 

(c) where on the your land the seed will settle down. 


1.1.10. Sunlight hitting at a particular spot in our courtyard is suddenly interrupted 
by a cloud passing through. The event of interest is to see: 

(a) at what time the interruption occurred; 

(b) at which location on the light beam the interruption occurred. 


1.2 Arandom experiment — 7 


1.1.11. You are looking through the binoculars to see the palm-lined beautiful shore of 

the other side of Vembanad Lake. Your view is interrupted by a bird flying across your 

line of view. The event of interest is: 

(a) at which point the view is interrupted (distance from the binocular to the starting 
position of the bird); 

(b) at what time the interruption occurred; 

(c) the duration of interruption. 


1.1.12. A student leader in Kerala is pelting stones at a passenger bus. The event of 
interest is to see: 

(a) how many of his fellow students join him in pelting stones; 

(b) extent of the money value of destruction of public property; 

(c) how many passengers are hit by stones; 

(d) how many passengers lose their eyes by the stones hitting the eyes. 


1.2 Arandom experiment 


Randomness is associated with the possible outcomes in a random experiment, not 
in the conduct of the experiment itself. Suppose that you consulted a soothsayer or 
*kaniyan" and conducted an experiment as per the auspicious time predicted by the 
kaniyan; the experiment does not become a random experiment. Suppose that you 
threw a coin. If the head (one of the sides of the coin) comes up, then you will con- 
duct the experiment, otherwise not. Still the experiment is not a random experiment. 
Randomness is associated with the possible outcomes in your experiment. 


Definition 1.1 (Random experiment). A random experiment is such that the possi- 
ble outcomes of interest, or the items that you are looking for, are not deterministic 
in nature or cannot be predetermined. 


(i) Suppose that the experiment is that you jump off of a cliff to see whether you 
fly out in the north, east, west, south, up or down directions. In this experiment, we 
know the physical laws governing the outcome. There are not six possible directions in 
which you fly, but for sure there is only one direction, that is, going down. The outcome 
is predetermined and the experiment is not a random experiment. 

(ii) If you put a 5-rupee coin into a bucket of water to see whether it sinks or not, 
there are not two possibilities that either it sinks or it does not sink. From the physi- 
cal laws, we know that the coin sinks in water and it is a sure event. The outcome is 
predetermined. 

(iii) If you throw a coin upward to see whether one side (call it heads) or the other 
side (call it tails) will turn up when the coin falls to the floor, assuming that the coin 


8 —— 1 Random phenomena 


will not stay on its edge when it falls to the floor, then the outcome is not predeter- 
mined. You do not know for sure whether heads = {H} will turn up or tails = {T} will 
turn up. These are the only possible outcomes here, heads or tails, and these outcomes 
are not predetermined. This is a random experiment. 

(iv) A child played with a pair of scissors and cut a string of length 20 cm. If the 
experiment is to see whether the child cut the string, then it is not a random experi- 
ment because the child has already cut the string. If the experiment is to see at which 
point on the string the cut is made, then it is a random experiment because the point 
is not determined beforehand. 


The set of possible outcomes in throwing a coin once, or the outcome set, denoted 
by S, is then given by S = (H, T] = (T, H}. 


Definition 1.2 (A sample space or outcome set). The set of all possible outcomes 
inarandom experiment, when no outcome there allows a subdivision in any sense, 
is called the sample space S or the outcome set S for that experiment. 


The sample space for the random experiment of throwing a coin once is given by 
S = (IH, T] = (T, H}. 


If the coin is tossed twice, then the possible outcomes are H or T in the first trial and 
H or T in the second trial. Then 


S = {(H, H), (H, T), (T, H), (T, T)] (1.1) 


where, for example, the point (H, T) indicates heads in the first trial and tails in the 
second trial. In this experiment of throwing a coin twice, suppose someone says that 
the sample space is 


S, = (two heads, one head, zero head} = {zero tail, one tail, two tails}, 


is S or S, the sample space here? Note that one tail or one head means that there are 
two possibilities of heads first and tails next or tails in the first trial and heads in the 
second trial. Thus the point “one head" or “one tail” allows a subdivision into two 
points (H, T) and (T, H) and in both of these points there is only exactly one head or 
exactly one tail. Hence οι allows one of its elements to be subdivided into two possible 
elements. Therefore, we will not take S, as the sample space here but S in (1.1) is the 
sample space here. 


Example 1.1. Construct the sample space of rolling a die once [A die is a cube with 
the faces marked with the natural numbers 1,2,3,4,5,6] and mark the subsets A, - the 
number is even, A, = the number is greater than or equal to 3. 


Solution 1.1. The possibilities are that one of the numbers 1,2,3,4,5,6 will turn up. 
Hence 


1.2 A random experiment — 9 


S = {1,2, 3,4, 5, 6} 
and the subsets are A, = {2,4, 6} and A, = {3, 4, 5, 6}. 


Note 1.2. In S, the numbers need not be taken in the natural order. They could have 
been written in any order, for example, S = {4, 3, 1,2, 5, 6} = {6, 5, 2, 1, 3,4}. Similarly, A, = 
{4, 2,6} = {6,2,4}. In fact, S can be represented in 6! = (1)(2)(3)(4)(5)(6) = 720 ways. 
Similarly, A, can be written in 4! = 24 ways and A, in 3! = 6 ways, because in a set, the 
order in which the elements appear is unimportant. 


Definition 1.3 (An event or a random event). Any subset A of the sample space S 
of a random experiment is called an event or a random event. Hereafter, when we 
refer to an event, we mean a random event defined in a sample space or a subset of 
a sample space. 


Definition 1.4 (Elementary events). If a sample space consists of n individual ele- 
ments, such asthe example of throwing a coin twice where there are 4 elements or 4 
points in the sample space S, the singleton elements in S are called the elementary 
events. 


Let A be an event in the sample space S, then A c S, that is, A is a subset of S or 
all elements in A are also elements of S. If A and B are two subsets in S, that is, 


AcS and BcS thenAUBcS 


is called the event of occurrence of either A or B (or both) or the occurrence of at least 
one of A and B. AU B means the set of all elements in A or in B (or in both). Also An B 
is called the simultaneous occurrence of A and B. Intersection of A and B means the 
set of all elements common to A and B. Here, U stands for *union" and n stands for 
“intersection”, A c S stands for A is contained in S or S contains A: 


AU B = occurrence of at least A or B. 


An B = simultaneous occurrence of A and B. 
Thus, symbolically, U stands for "either, or" and n stands for *and". 


Example 1.2. In a random experiment of rolling a die twice, construct: 
(1) the sample space S, and identify the events; 

(2) A = event of rolling 8 (sum of the face numbers is 8); 

(3) B = event of getting the sum greater than 10. 


Solution 1.2. Here, in the first trial, one of the six numbers can come, and in the sec- 
ond trial also, one of the six numbers can come. Hence the sample space consists of 


10 —— 1 Random phenomena 


all ordered pairs of numbers from 1 to 6. That is, 
S= {(1,1), (1,2), ..., (1,6), (2,1), ..., (6, 6)}. 


There are 36 points in S. A is the event of rolling 8. This can come by having 2 in the 
first trial and 6 in the second trial or 3 in the first trial and 5 in the second trial and so 
on. That is, 


A= {(2, 6), (3,5), (4, 4), (5, 3), (6,2)}. 
The event of getting the sum greater than 10 means the sum is 11 or 12. Therefore, 
B= {(5,6), (6,5), (6, 6)}. 
Example 1.3. In the experiment of throwing a coin twice, construct the events of get- 
ting: 
(1) exactly two heads; 


(2) at least one head; 
(8) at least one tail, and interpret their unions and intersections. 


Solution 1.3. Let A = event of getting exactly two heads, B = event of getting at least 
one head, and C = event of getting at least one tail. Here, the sample space 


S = {(H,H),(H,T),(T,H),(T,T)}. 
Then 


A 
C 


(H.H), B-((H,T) (T.H), (H,H)), 
{(T, H), GL, T) (T, T)]. 


At least one head means exactly one head or exactly two heads (one or more heads), 
and similar interpretation for C also. [The phrase “at most" means that number or 
less, that is, at most 1 head means 1 head or zero head.] AU B - B since A is contained 
in B here. Thus occurrence of A or B or both here means the occurrence of B itself 
because occurrence of exactly 2 heads or at least one head implies the occurrence of 
at least one head. AU C = {(H, H), (H, T), (T, H), (T, T)} = occurrence of exactly 2 heads 
or at least one tail (or both), which covers the whole sample space or which is sure to 
occur, and hence S can be called the sure event. A n B = {(H,H)} because this is the 
common element between A and B. The simultaneous occurrence of exactly 2 heads 
and at least one head is the same as saying the occurrence of exactly 2 heads. ANC = 
null set. There is no element common to A and C. A null set is usually denoted by a 
big O or by the Greek letter $ (phi). Here, An C = $. Also observe that it is impossible 
to have the simultaneous occurrence of exactly 2 heads and at least one tail because 
there is nothing common here. Thus the null set $ can be interpreted as the impossible 


1.2 Arandom experiment —— 11 


event. Note that, by definition, 

AUB=BUA, ANB=BnA, AUC=CUA, ANC=CNA. 
Now, 

BUC-B or C(orboth) = {(H,T),(T,H),(H,H),(T,T)}=S 


which is sure to happen because the event of getting at least one head or at least one 
tail (or both) will cover the whole sample space: 


BnC={(H,T),(T,H)} = 


event of getting exactly one head = event of getting exactly one tail, since the com- 
mon part is the occurrence of exactly one head or exactly one tail, which then is the 
intersection of B and C. 


Also singleton elements are called elementary events in a sample space. Thus in 
Example 1.3, there are 4 elementary events in S. Also, the non-occurrence of an event 
Ais denoted by A or A^. We will use the notation A‘ to denote the non-occurrence of A. 
In Example 1.3, if A is the event of getting exactly 2 heads, then A^ will be the event of 
the non-occurrence of A, which means the occurrence of exactly one head or exactly 
zero heads (or exactly 2 tails). Thus A^ = ((H, T), (T, H), (T, T)) where A = {(H, H)}. 


Notation 1.1. A^ = non-occurrence of the event A when A and Α΄ are in the same sam- 
ple space S. 

Note that if A and B are two events in the same sample space 5, then if An B - $, 
this means that they cannot occur simultaneously or the occurrence of A excludes the 
occurrence of B and vice versa. Then A and B will be called mutually exclusive events. 


ΑΠΒΞΦΞ A and B are mutually exclusive or the occurrence of A excludes the 
occurrence of B and vice versa. 


In the light of the above discussion, we have the following general results and 
interpretations for events in the same sample space: 


S - sample space - sure event (i) 


¢ = null set = impossible event (ii) 


[Note that by assumption a null set is a subset of every set.] 


A U B = occurrence of at least A or B or both (iii) 
An B = simultaneous occurrence of A and B (iv) 
AN B= $ means A and B are mutually exclusive (v) 
AUB - S means that A and B are totally exhaustive events (vi) 


Αἵ = complement of A in S = non-occurrence of the event A. (vii) 


12 —— 1 Random phenomena 


Single elements in S are called elementary events if S consists of distinct finite number 
of elements. 


Note 1.3. In the example of throwing a coin once, suppose that a physicist is capable 
of computing the position ofthe coin and the amount of pressure applied when it was 
thrown, all the forces acting on the coin while it is in the air, etc., then the physicist 
may be able to tell exactly whether that throw will result in a head or tail for sure. 
In that case, the outcome is predetermined. Hence one can argue that an experiment 
becomes random only due to our lack of knowledge about the various factors affecting 
the outcome. Also note that we do not have to really throw a coin for the description of 
our random experiment to hold. We are only looking at the possible outcomes “if” a 
coin is thrown. After it is thrown, we already know the outcome. In the past, a farmer 
used to watch the nature of the cloud formation, the coolness in the wind, the direction 
of the wind, etc. to predict whether a rain is going to come on that day. His prediction 
might be wrong 70% of the times. Nowadays, a meteorologist can predict, at least in 
the temperate zones, the arrival of rain including the exact time and the amount of 
rainfall, very accurately at least one to two days beforehand. The meteorologist may be 
wrong in less than 196 of the time. Thus, as we know more and more about the factors 
affecting an event, we are able to predict its occurrence more and more accurately, and 
eventually possibly exactly. In the light of the above details, is there anything called a 
random experiment? 


Before concluding this section, let us examine one more point. The impossible 
event φ is often misinterpreted. Suppose that a monkey is given a computer to play 
with. Assume that it does not know any typing but only playing with the keyboard with 
the English alphabet. Consider the event that the monkey's final creation is one of the 
speeches of President Bush word-by-word. This is not a logically impossible event and 
we will not denote this event by $. We use $ for logically impossible events. The event 
that the monkey created one of the speeches is almost surely impossible. Such events 
are called almost surely impossible events. Consider the event of the best student in this 
class passing the next test. We are almost sure that she will pass but only due to some 
unpredicted mishap she may not pass. This is almost surely a sure event but not logi- 
cally a sure event, and hence this cannot be denoted by our symbol S for a sure event. 


Exercises 1.2 


1.2.1. Write down the outcome set or the sample space in the following experiments: 
(a) A coin is tossed 2 times (assume that only head - H or tail - T can turn up); 

(b) Two coins together are tossed once; 

(c) Two coins together are tossed two times. 


1.2.2. Write down the outcome set or sample space in the following experiment: 


1.2 Arandom experiment —— 13 


(a) A die is rolled two times; 
(b) Two dies are rolled together once; 
(c) Two dies together are rolled two times. 


1.2.3. Count the number of elementary events or sample points in the sample spaces 
of the following experiments, if possible: 

(a) A coin is tossed 20 times; 

(b) Two coins together are tossed 15 times; 

(c) A die is rolled 3 times; 

(d) Two dies together are rolled 5 times. 


1.2.4. Write down the sample space in the following experiments and count the num- 

ber of sample points, if possible: 

(a) Arandom cutis made on a string of a length of 20 cm (one end is marked zero and 
the other end 20 and let x be the distance from zero to the point of cut); 

(b) Two random cuts are made on a string of a length of 20 cm and let x and y be the 
distances from zero to the points of cut, respectively. 


1.2.5. There are 4 identical tags numbered 1,2,3,4. These are put in a box and well 

shuffled and tags are taken blind-folded. Write down the sample spaces in the follow- 

ing situations, and count the number of sample points in each case, if possible: 

(a) One tag is taken from the box; 

(b) One tag is taken. Its number is noted and then returned to the box, again shuffled 
and a second tag is taken (this is called sampling with replacement); 

(c) In (b) above, the first tag is kept aside, not returned to the box, then a second tag 
is taken after shuffling (this is called sampling without replacement); 

(d) Two tags are taken together in one draw. 


1.2.6. Write down the sample space in the following cases when cards are taken, 
blind-folded, from a well-shuffled deck of 52 playing cards, and count the number of 
sample points in each case, if possible: 

(a) One card is taken; 

(b) Two cards are taken at random with replacement; 

(c) Two cards are taken at random without replacement; 

(d) Two cards together are taken in one draw. 


1.2.7. Construct the sample spaces in the following experiments: 
(a) Checking the life-time x of one electric bulb; 
(b) Checking the life-times x and y of two electric bulbs. 


1.2.8. In Exercise 1.2.1 (b), write down the following events, that is, write down the 
corresponding subsets of the sample space: 


14 — 1 Random phenomena 


(a) A =the event of getting at most one tail; 

(b) B =the event of getting exactly 3 tails; 

(c) C = the event of getting at most one head; 

(d) Interpret the events (i): ANB, (ii): AUB, (iii): ANC, (iv): AUC, (v): AS, (vi): (AUC), 
(vii): (AN C)*. 


1.2.9. In Exercise 1.2.2 (b), write down the following events: 

(a) A = the number in the first trial is bigger than or equal to the number in the second 
trial; 

(b) B =the sum of the numbers in the two trials is (i): bigger than 12, (ii): between 8 
and 10 (both inclusive), (iii): less than 2. 


1.2.10. In Exercise 1.24 (a), write down the following events and give graphical rep- 
resentations: 

(a) A = event that the smaller portion is less than 5 cm; 

(b) B = event that the smaller portion is between 5 and 10 cm; 

(c) C = the event that the smaller portion is less than 3 of the larger portion. 


1.2.11. In Exercise 1.2.4 (b), write down the following events and give graphical repre- 
sentations: 

(a) x «y; 

(b) x+y «10; 

(ο) x! xy^; 

(d) xy x 5. 


1.2.12. In Exercise 1.2.5 (b), write down the following events: 
(a) A = event that the first number is less than or equal to the second number; 
(b) B = event that the first number is less than or equal to I of the second number. 


1.2.13. In Exercise 1.2.5 (c), write down the same events A and B in Exercise 1.2.12. 


1.3 Venn diagrams 


We will borrow the graphical representation of sets as Venn diagrams from set theory. 
In a Venn diagrammatic representation, a set is represented by a closed curve, usually 
a rectangle or circle or ellipse, and subsets are represented by closed curves within 
the set or by points within the set or by regions within the set, see Figures 1.1, 1.2, 1.3. 

Examples are the following: Ac S, Bc S, Cc S, D c S, E c S all are events in the same 
sample space S. In the Venn diagram in Figure 1.2, A and B intersect, A and C intersect, 
B and C intersect, A, B, C all intersect and D and E do not intersect. An B is the shaded 
region, also An BNC is the shaded region. Dn ΕΞ $ or they are mutually exclusive. By 


1.3 Venn diagrams —— 15 


Ὁ QE 


Figure 1.1: Venn diagrams for sample space. 


B A B E 


in * e 
AnBnc C D 


Figure 1.2: Representations of events. 


Figure 1.3: Union, intersection, complement of events. 


looking at the Venn diagram, we can see the following general properties. A U B can 
be split into three mutually exclusive regions A n B^, An B, A‘ n B. That is, 


AUB-(AnB*)u(AnB)u(A^nB) 
-Au(A*nB)- Bu(B*nA) (1.2) 


where 


(AnB^)n(AnB)-$, (AnB*)n(A*nB)- Q, 


(AnB)n(A* nB) =¢, (1.3) 
An(A*nB)- $, 
Bn (Bo nA)=¢ (1.4) 


or they are all mutually exclusive events. Also note that for any event A, 
AUg=A, ΑΠΦΞΦ (1.5) 


which also means that the sure event S and the impossible event $ are mutually ex- 
clusive events as well as totally exhaustive events. 


16 —— 1 Random phenomena 


Consider the set of events A4, ..., A; in the same sample space S, that is, A; c S, 
j=1,2,...,k where k could be infinity also or there may be countably infinite number 
of events. Let these events be mutually exclusive and totally exhaustive. That is, 


ΑΙΩΑ2ΟΞΦ, ΑΙΠΑΙΞΦ, .... A,NA,= 
A,NA3=9, mn A; A, = $, mn Ay NA, = and 
S-AQUA,U--UA,. 


This can also be written as follows: 
Ain Aj =9, for alli#j, i,j=1,2,...,k, AQUA,U--UA,-S. 


Then we say that the sample space S is partitioned into k mutually exclusive and totally 
exhaustive events. We may represent this as the following Venn diagram in Figure 1.4. 


Figure 1.4: Partitioning of a sample space. 


If B is any other event in S where S is partitioned into mutually exclusive events 
44,45, ..., Aj, then note from the Venn diagram that B can be written as the union of 
mutually exclusive portions BnA,, Bn A;,...,BNA;,, some of the portions may be null, 
that is, 


B-(BnAgu(BnAj,)U--U(BnA,) 
with 


(ΒΩΑιΙ)Ω(ΒΩΑ.)ΞΦ, .... (BnA, )0n(BnA,) =O. (1.6) 


Exercises 1.3 


1.3.1. Consider the sample space S when a coin is tossed twice: 
S = ((H, H), (H, T), (T, H), (T, T)]. 


Indicate the correct statements from the following list of statements: 
(a) (Η,Η) ε5; 


1.3 Venn diagrams —— 17 


(b) {(H,H)} € S; 
(c) (UG, H)) c S; 
(d) (GI, T), (T, H)] ς 5; 
(ο) (H,T),(T,H)) e S. 


1.3.2. Let S = {x | Ox x x 10}. List the correct statements from the following list of state- 
ments: 

(a) 2€S; 

(b) {2} €S; 

(c) A={x|2<x<5}eS; 

(d) A={x|2<x<5}cS. 


1.3.3. For the same sample space S in Exercise 1.3.1, let A = ((H, H)}, B= {(H, H), (H, T), 
(T, H)), C = {(T, T)). Draw one Venn diagram each to illustrate the following: (1) A and 
A‘; (2) AUB, ANB, AUBUC, An BnC. 


1.3.4. By using the definition of sets, prove that the following statements hold with 
reference to a sample space S where A, B,C are events in S: 

(a) AUBUC=(AUB)UC=AU (BUC); 

(b ANBNC=(ANB)NC=AN(BNC). 


1.3.5. For the same sample space S and events A, B,C in Exercise 1.3.3, verify the fol- 
lowing results with the help of Venn diagrams and interpret each of the events: 

(a) AU(BNC)=(AUB)n (AUC); 

(b) An(BuC)-(AnB)u(AnC) 

(c) (BUC) - B*nC*; 

(d) (ANB) =A UBS. 


1.3.6. By constructing an example of your own, verify the fact that for three events in 
the same sample space S: 

(a) AUB- AUC need not imply that B= C; 

(b) An B-AnC need not imply that B = C. 


1.3.7. For events A, B, C in the same sample space S, prove the following results: 
(a) (AU B) - A. n B5; 

(b) (An B)? = AS UBS; 

(c An(BuC)-(AnB)u(AnC) 

(d) Au(BnC)- (AUB)n(AuC). 


1.3.8. For events 4,,4,, ... in the same sample space S, prove the following results: 
(a) (A, UAU) = A£nA$ n. 
(b) (A,N.A,N---)© 2 AfUASU--. 


18 —— 1 Random phenomena 


1.4 Probability or chance of occurrence of a random event 


Now, we shall try to assign a numerical value for the chance of occurrence of a random 
event in a well-defined sample space S. Let P(A) denote the probability or chance of 
occurrence of the event A c S. 


Notation 1.2. P(A) - the probability of the event A. 
We are trying to give a meaning to the following types of statements that you hear 
every day: 
(1) There is a 9596 chance of rain today = P(A) = 0.95 where the symbol => stands for 
“implies”, and A is the event of having rain. 
(2 The chance of winning a certain lottery is one in a million > P(A) = 
(3) The chance of a flood on campus today is nearly zero > P(A) = 0. 
(4) The chance that Miss Cute will win the beauty contest is more than Miss Wise 
wins = P(A) > P(B) where A is the event that Miss Cute wins and B is the event 
that Miss Wise wins. 


1 
10000000* 


We can define this probability by using some postulates or axioms. Postulates or 
axioms are logically consistent and non-overlapping types of basic assumptions that 
we make to define something. Usually such postulates are taken by taking into con- 
sideration plausible properties that we would like to have for the item to be defined. 
There is no question of proving or disproving these basic assumptions. The following 
three postulates will be used to define P(A) the probability or chance of occurrence of 
the event A. 


1.4.1 Postulates for probability 


(i) O< P(A) x 10r the probability of an event is a number between 0 and 1, both 
inclusive; 

(ii) P(S) 2 10r the probability of the sure event is 1; 

(iii) P(A, UA, U ---) = P(A,) + P(A) + --- whenever Αι, Α;,... are mutually exclusive 
[The events may be finite or countably infinite in number]. 


Thus P(-) coming out of the above three axioms will be called the probability of the 
event (-). Let us see what will be that quantity in the simple example of throwing a 
coin once. 


Example 1.4. Compute the probability of getting a head when a coin is tossed once. 


Solution 1.4. We have already computed the sample space for this experiment: 


S = {H, T}. 


1.4 Probability or chance of occurrence of a random event —— 19 


For convenience, let A be the event of getting a head, then A = {H} and let B be the 
event of getting a tail, that is, B = {T}. When we get a head, we cannot get a tail at the 
same time, and hence A and B are mutually exclusive events or A n B = @. In a trial, 
one of the events A or B must occur, that is, either a head or a tail will turn up because 
we have ruled out the possibility that the coin will fall on its edge. Thus AUB - S. Thus 
we have 


AnB-$ and AUB=S. 
From postulate (ii), 
1- P(S)- P(AU B). 
Now from postulate (iii), 
P(AU B) = P(A) + P(B) 
since An B= $. Therefore, 
1=P(A)+P(B) = P(A)-1-P(D). 


We can only come up to this line and cannot proceed further. The above statement 
does not imply that P(A) = 5. There are infinitely many values P(A) can take so that 
the equation 1 = P(A) + P(B) is satisfied. In other words, by using the postulates we 
cannot compute P(A) even for this simple example of throwing a coin once. 


Example 1.5. Compute the probability of getting a sum bigger than 10 when a die is 
rolled twice. 


Solution 1.5. Here, the sample space consists of 36 elementary events as seen before. 
Let A be the event of getting a sum bigger than 10, which means a sum 11 or 12. Then 
A= {(5, 6), (6, 5), (6, 6)} and let B be the event of getting a sum less than or equal to 10 or 
the complement of A or B = A‘. But for any event A, AUA, = Sand An A* = $ or these 
two are mutually exclusive and totally exhaustive events. Therefore, by postulates (ii) 
and (iii) we can come up to the stage: 


1 = P(A) + P(B). 


We know that the event A has 3 of the 36 elementary events in the sample space S. 
But we cannot jump into the conclusion that therefore P(A) = x because from the 
definition of probability, as given by the postulates, does not depend upon the number 
of sample points or elementary events favorable to the event. If someone makes such 
aconclusion and writes the probability as x in the above case, it will be wrong, which 
may be seen from the following considerations: 


20 —— 1 Random phenomena 


(i) A housewife is drying clothes on the terrace of a high-rise apartment building. 
What is the probability that she will jump off the building? There are only two possi- 
bilities: either she jumps off or she does not jump off. Therefore, if you say that the 
probability is I obviously, you are wrong. The chance is practically nil that she will 
jump off the building. 

(ii) What is the probability that there will be a flash flood on this campus in the 
next 5 minutes? There are two possibilities: either there will be flash flood or there will 
not bea flash flood. If you conclude that the probability is therefore 7 you are definitely 
wrong because we know that the chance of a flash flood here in the next 5 minutes is 
practically nil. 

(iii) At this place, tomorrow can be a sunny day, or a cloudy day or a mixed sunny 
and cloudy day, or a rainy day. What is the probability that tomorrow will be rainy 
day? If you say that the probability is i since we have identified four possibilities, 
you can be obviously wrong because these four possibilities need not have the same 
probabilities. Since today is sunny and since it is not a rainy season, most probably 
tomorrow will also be a sunny day. 

(iv) A child cuts a string of 40 cm in length into two pieces while playing with a 
pair of scissors. Let one end of the string be marked as 0 and the other end as 40. 
What is the probability that the point of the cut is in the sector from 0 to 8 cm? Here, 
thenumber ofsample points is infinite, not even countable. Hence by the misuse ofthe 
idea that the probability may be the number of sample points favorable to the event to 
the total number of sample points, we cannot come up with an answer, even though 
wrong, as in the previous examples. The total number as well as the number of points 
favorable to the event cannot be counted. Then how do we calculate this probability? 

(v) A person is throwing a dart at a square board of 100 cm in length and width. 
What is the probability that the dart will hit in a particular 10 cm x 10 cm region on the 
board? Here, also we cannot count the number of sample points even if to misuse the 
numbers to come up with an answer. Then how do we compute this probability? 

(vi) A floor is paved with square tiles of length m units. A circular coin of diame- 
ter d units is thrown upward, where d « m. What is the probability that the coin will 
fall clean within a tile, not cutting its edges or corners? This is the famous Buffon's 
*clean tile problem" from where the theory of probability has its beginning. How do 
we answer these types of questions (definitely not by counting the sample points)? 


From the above examples, it is clear that the probability of an event does not de- 
pend upon the number of elementary events in the sample space and the number of 
elementary events favorable to the event. It depends upon many factors. Also we have 
seen that our definition of probability, through the three axioms, does not help us to 
evaluate the probability of a given event. In other words, the theory is useless, when 
it comes to the problem of computing the probability of a given event or the theory 
is not applicable to a practical situation, unless we introduce more assumptions or 
extraneous considerations. 


1.4 Probability or chance of occurrence of a random event —— 21 


Before we introduce some rules, we can establish some general results by using 
the axioms for a probability. 


Result 1.1. Probability of an impossible event is zero, that is, Ρ(Φ) = 0. 


Proof 1.1. Consider the sure event S and the impossible event φ. From the definitions 
SuU$-S and Sn$-p. 
Hence from postulates (ii) and (iii): 
1- P(S) = P(S U φ) = P(S) + P(p) =1 + P(d) 
since Sn $ = φ. But probability is a real number. Therefore, 
1=1+P(¢?) => P(p)=0. 


Result 1.2. Probability of non-occurrence = 1-probability of occurrence or P(A‘) = 
1- Ρ(Α). 


Proof 1.2. We note that A and Α΄ are mutually exclusive and totally exhaustive events, 
and hence from axioms (ii) and (iii) we have 


S=AUAS and AnA^-$ = 1=P(A)+P(AS) = P(AS)=1-P(A). 


For example, if 0.8 is the probability that Abhirami will pass the next class test then 
0.2 is the probability that she may not pass the next class test. If 0.6 is the probability 
that Abhirami may be the top scorer in the next class test, then 0.4 is the probability 
that she may not be the top scorer in the next class test. 


Result 1.3. For any two events A and B in the same sample space S, 


P(A UB) = P(A) + P(B) - P(A n B). 


Proof 1.3. From equation (1.2) or from the corresponding Venn diagram, we have seen 
that AU B can be written as the union of three mutually exclusive events An B^, ANB 
and Α΄ n B. Hence from axiom (iii), 


P(AUB) = P(An B^) + P(An B) + P(A^ n B) (a) 
P(A) = P(An B^) + P(An B) 


since 


(AnB.))n(AnB)-$ = P(AnB^)-P(A)-P(AnB). (b) 


22 —— 1 Random phenomena 


Similarly, 
P(B)=P(BN A‘) +P(ANB) = P(BnA?)-P(B)- P(AnB). (c) 
Substituting (b) and (c) in (a), we have 


P(A UB) = P(A) - P(A A B) + P(B) - P(A n B) + P(A A B) 
= P(A) + P(B) - P(A A B). 


This completes the proof. This can also be guessed from the Venn diagrammatic repre- 
sentation, taking probability as some sort of measure over the regions. The same mea- 
sure over the region A plus over the region B will count twice over the region A n B, 
and hence once it should be subtracted. But this is not a proof but the results can be 
guessed from the Venn diagram. The above results can be extended for a set of three 
events or to a set of k events, k — 2,3, .... 


Result 1.4. Let A,B and C be three events in the same sample space S. Then 


P(Au BU C) = P(A) + P(B) + P(C) - P(An B) - P(An C) 
-P(BnC)«P(AnBnC). 


Proof1.4. The proof follows parallel to the steps in the proof for the Result 1.3, and 
hence it is left to the students. Also as an exercise, write down the formula for the 
probability of the union of k events A4, ..., Αχ. 


1.5 How to assign probabilities to individual events? 


Now, we introduce a number of rules so that with the help of the axiomatic definition 
and the following rules one may be able to compute the probabilities in a number of 
situations. 


Rule 1.1 (Symmetry in the outcomes). If the sample space consists of a finite num- 
ber of distinct elements and if the physical characteristics of the experiments are such 
that, with respect to all factors which may affect the possible outcomes, there is no 
way of preferring one outcome to the other then the rule says to assign equal proba- 
bilities to the elementary events. 


In order to apply this rule, one has to have a sample space consisting of a finite 
number of elements, situations such as tossing a coin once or a number of times, 
rolling a die a number of times, predicting successful completion of a job when there 
are only a fixed number of alternatives, predicting rainfall, etc. The rule does not apply 
to situations such as the cutting a string where the sample space consists of a contin- 


1.5 How to assign probabilities to individual events? —— 23 


uum of points, throwing a dart at a target where the sample space consists of regions, 
Buffon's clean tile problem where the sample space consists of a room paved with tiles 
or a finite planar region, and so on. The implication of Rule 1.1 is the following. 


Rule 1.1a. When there is symmetry in the outcome of a random experiment and when 
there are k elementary events in the sample space S, k being finite, and if m of the 
sample points (elementary events) are favorable to an event A, then the probability 
of the event A will be taken as 


number of sample points favorable to A πι 


P(A)= ; 
e total number of sample points k 


(1.7) 


Let us take the example of tossing a coin twice. The sample space is S = 
(GI, T), CT, H), (T, T)). If the physical characteristics of the coin are such that there 
is no way of preferring one side to the other (in such a case we call the coin an unbi- 
ased coin or not loaded towards one side), the throwing ofthe coin is such that there is 
no advantage for one side over the other or, in short, with respect to all factors which 
may affect the outcomes, there is no advantage for one side over the other, then in 
this case we assign equal probabilities of i to the individual elements in this sample 
space. That is, we assign, for the event of getting the sequence H first and T next a 
probability of i 

In some books, you may find the description saying when the "events are equally 
likely" they have equal probabilities. The statement is circumlocutory in the sense 
of using "probability" to define probability. Symmetry has nothing to do with the 
chances for the individual outcomes. We have the axioms defining probabilities and 
we have seen that the axioms are not sufficient to compute the probabilities in specific 
situations, and hence it is meaningless to say “equally likely events” when trying to 
compute the probabilities of events. Symmetry is concerned about the physical char- 
acteristics of the experiments and the factors affecting the outcomes and not about 
the chances of occurrence of the events. 

The phrase used to describe symmetry are *unbiased coin" in the case of coins, 
“balanced die” in the case of rolling a die, and in other cases, we say “when there is 
symmetry in the experiment or symmetry in the outcomes". 

Thus, in the example of tossing a coin if we ask: 

What is the probability of getting a head when an unbiased coin is tossed once, 
then the answer is 3, This value is assigned by us by taking into account of symmetry 
in the experiment, and not coming from the axioms or deduced from somewhere. 

What is the probability of getting exactly one head when an unbiased coin is 
tossed twice? 

Answer: Let A be the event of getting exactly one head. Let A, be the event of get- 
ting the sequence (H, T) and A, be the event of getting the sequence (T, H). Then 


A=A,UA, and ΑΙΩΑ2ΞΦ. 


24 ---- 1 Random phenomena 


Therefore, 
P(A) = P(A,) + P(A) 


by the third axiom. But we have assigned probabilities i to individual outcomes be- 
cause of the additional assumption of symmetry, and hence P(A,) = i and P(A,) = i 
Therefore, 


pA iu iot 
4 4 


Example 1.6. An unbiased coin is tossed (a) three times and (b) four times. What are 
the probabilities of getting the sequences (i) HHT, (ii) THT in (a) and the sequences 
(iii) HHTT, (iv) HHHH or HTTT in (b)? 


Solution 1.6. In (a), the sample space consists of all possible sequences of H and T 
and there are 8 such elementary events. They are available by looking at the problem 
of filling three positions by using H, and T. The first position can be filled in two ways, 
either H or T. For each such choice, the second position can be filled in two ways. For 
each such choice, for the first and second positions the third can be filled in two ways 
so that the number of possible outcomes is 2x 2x 2 - 8. They are the following: 


HHH, HHT, HTH, HTT, THH, THT, TTH, TTT. 


Since we assumed symmetry, all these 8 points are assigned probabilities i each. 
Hence the answers to (i) and (ii) are 


P{(HHT)} = κ and P{(THT)} = 5. 


When the coin is tossed four times, the sample space consists of 2 x 2 x 2 x 2 = 16 ele- 
mentary events. Due to symmetry, we assign probabilities τς to each of these points. 
Hence 


P{(HHHH)} = Z. 


In (iv), the event of getting the sequences HHHH or HTTT means the union of two 
mutually exclusive events and by the third axiom, the probability is the sum of the 
probabilities. We have assigned probabilities ἐς each, and hence 

P((HHHH or HTTT)} = P{(HHHH)} + P{(HTTT)} = = + " - 5. 
Example 1.7. A balanced die is rolled two times. What is the probability of (i) rolling 
9 and (ii) getting a sum greater than or equal to 10? 


Solution 1.7. When we say “balanced”, it means that we are assuming symme- 
try in the experiment and we are assigning equal probabilities to all elementary 


1.5 How to assign probabilities to individual events? ---- 25 


events. Here, there are 36 elementary events and each point will get probability of 
x each. (i) Rolling 9 means the sum of the face numbers is 9. The possible points 
are (3,6), (4,5), (5,4), (6, 3). These are mutually exclusive because, for example, when 
the sequence (3,6) comes at the same time another sequence cannot come. Let A be 
the event of rolling 9, and let A, to A, denote the events of getting the sequences 
(3,6), ..., (6,3), respectively. Then 


A = A; UA, U A; UA; 
and 


A NA =¢ġ, ΑΙΩΠΑ:ΞΦ, A,nA4,-6, 
ANA; - $, A NA, = $, A; NA, =. 


That is, they are all mutually exclusive. Hence by the third axiom in the definition of 
probability 


P(A) = P(A4) + P(A) + P(A3) + P(A,). 


But we have assigned equal probabilities to elementary events. Hence 


1 1 4 1 
P(A)= RT LI = " 
(= 36+ +36 73679 


Similarly, let B be the event of getting the sum greater than or equal to 10, which means 


10 or 11 or 12. The points favorable to this event are (4, 6), (5. 5), (6, 4), (5, 6), (6, 5), (6, 6) 
or 6 points are favorable to the event B and since symmetry is assumed 


6 
P(B)= —- 


Ale 


Rule 1.2. Assign probability O for almost surely impossible events and probability 1 
for almost surely sure events. 


By assigning probability 0, we are not saying that the corresponding event is log- 
ically impossible. If an event is logically impossible, then its probability is zero as a 
consequence of the axioms defining probability. When we assign 1 to almost surely a 
sure event, we are not saying that the event is a sure event. For a logically sure event, 
the probability is 1 by the second axiom defining probability. But an assigned proba- 
bility 1 does not mean that the event is a sure event. 


Rule 1.3. If the sample space consists of a continuum of points giving a line segment 
(or segments) of finite length (or lengths), such as a piece of string of a length of 50 cm, 
and if the experiment is to take a point from this line segment (or segments), such as 


26 —— 1 Random phenomena 


a cut on this string, and if there is no preference of any sort in selecting this point, then 
assign probabilities proportional to the lengths, taking the total length as unity. 


When a point is selected from a line segment of finite length by using the rule of 
assigning probabilities proportional to the lengths, then we use the phrase: a point 
is selected at random from the line segment or we have a “random point" from this 
line segment or if a string is cut by using the above rule we say that we have a random 
cut of the string or we say that the point of cut is uniformly distributed over the line 
segment. These are all standard phrases used in this situation. Then, if an event A is 
that the random point lies on a segment of length m units out of a total length of n 
units, n > m, then the rule says: 


P(A)- = (1.8) 


Example 1.8. A random cut is made on a string of 30 cm in length. Marking one end 
of the string as zero and the other end as 30, what is the probability that (i) the cut 
is between 10 and 11.7, (ii) the cut is between 10 and 10.001, (iii) the cut is at 10, and 
(iv) the smaller piece is less than or equal to 10 cm? 


Solution 1.8. Since we use the phrase “random cut”, we are assigning probabilities 
proportional to the lengths. Let x be the distance from the end marked 0 to the point 
of cut. Let A be the event that A = {x | 10 < x x 11.7}, [this notation means: all values of 
x such that x is between 10 and 117, both the end points are included], B be the event 
that B = (x | 10 < x x 10.001}, let C be the event that C = {x | x = 10} and let D be the 
event that the smaller piece is less than or equal to 10 cm. The length of the interval in 
A is 11.7 - 10.0 = 17. Since we are assigning probabilities proportional to the lengths, 
we have 


11.7-10.0_17_ 17 


P(A) = 
30 30 300 
P(B) = 10.001 - 10.000 " 0.001 " 1 
30 30 30000 
and 
10-10 ο 
P(C) = = — = 
30 30 


Since we are assigning probabilities proportional to the lengths and since a point does 
not have any length by definition, then according to this rule the probability that the 
cutis at a specific point, in a continuum of points, is zero. By assigning this value zero 
to this probability, we are not saying that it is impossible to cut the string at that point. 
As per our rule of assigning probabilities proportional to lengths, then since a point 
does not have length, the point will be assigned probability zero as per this rule. For 
the event D, the smaller piece is of length less than or equal to 10 cm in the following 


1.5 How to assign probabilities to individual events? —— 27 


two situations: 


D,={x|O<x<10} and D;-íx|20xx x30). 


Therefore, 
D=D,UD, whereD,nD;- ©. 
Hence 
10-0 30-20 
P(D) = P(D,) + P(D,) = —— 
(D) (D,) + P(D;) 30 T 30 


1010202 
30. 30 30 3 

Note that this variable x can be said to be uniformly distributed over the line segment 

[0, 30] in this example. 


Note 1.4. The above rule cannot be applied if the string is of infinite length such as a 
beam of light or laser beam or sound wave, etc. How do we compute probabilities in 
such situations of strings of infinite length? 


Rule 1.4. When a point is selected at random from a planar region of finite area, 
assign probabilities proportional to the area and when a point is selected at random 
from a higher dimensional space of finite hyper-volume, then assign probabilities pro- 
portional to the volume. According to this rule, if the total area is a and out of this, if 
μία) of the area is favorable to an event A, then the probability of A is assumed as 


pA = I (19) 
a 


where a is the Greek letter alpha and p is the Greek letter mu. Similarly, if v is the total 
volume (or hyper-volume) of the space under consideration and if the fraction u(v) of 
v is favorable to an event A then, as per the above rule, the probability of A is taken as 


P(A) = μα (119) 


Severalitems here need explanations: When a point is taken at random from a pla- 
nar region of finite area a, such as the point of hit of an arrow when the arrow is shot 
onto a wall of a length 10 meters and a width of 2 meters (area = a = 10 x 2 = 20 sq me- 
ters), here *at random" means that there is no preference of any sort for the point to 
be found anywhere on the planar region. Then we assign probabilities E to every pos- 
sible subregion of area a, with a similar interpretation for higher dimensional situa- 
tions. The standard terminology for length, area, volume, etc. is the following: length 
(one dimensional), area (two-dimensional), volume (3-dimensional), hyper-volume 


28 —— 1 Random phenomena 


(4 or higher dimensional). For simplicity, we say *volume" for 3 or higher dimensional 
cases, instead of saying “hyper-volume”. 


Example 1.9. A person trying dart throwing for the first time throws a dart at random 
to a circular board of a radius of 2 meters. Assuming that the dart hits the board, what 
is the probability that (1) it hits within the central region of radius 1 meter; (2) it hits 
along a horizontal line passing through the center and (3) it hits exactly at the center 
of the board as shown in Figure 1.5? 


Solution 1.9. Assuming that the point of the hit is a random point on the board, we 
may assign probabilities proportional to the area. The total area of the board is the 
area of a circle with a radius of 2 meters: 


Total area = zt? = π(2)᾽ = 4 m? 


where the standard notation m? means square meters. (1) The area of the central region 
of the radius of one meter = π(1)’ = 7x m°. Hence the required probability, denoted by 
P(A), is 


Figure 1.5: Circular board and circular, line, point targets. 


For answering (2), we have to look at the area along a line passing through the cen- 
ter. But, by definition, a line has no area, and hence the area here is zero. Thus the 
required probability is na Ξ 0. In (3) also, a point has no area by definition, and hence 
the probability is zero. 


Note 1.5. Note that the numerator and denominator here are in terms of square me- 
ters, but probability is a pure number and has no unit of measurement or does not 
depend on any unit of measurement. 


Note 1.6. Also when assigning probabilities proportional to the area, remember that 
lines and points have no areas, and a point has no length or area but a line has length 
but no area. Similarly, when assigning probabilities proportional to the volume, re- 
member that a planar region has no volume but it has area, and a line has no volume 
or area but has length, and a point has no length, area or volume. 


1.5 How to assign probabilities to individual events? —— 29 


1.5.1 Buffon's "clean tile problem" 


Example 1.10. Solve Buffon's clean tile problem. That is, a circular coin of diame- 
ter d is thrown upward. When it falls on the floor paved with identical square tiles of 
length m with d « m, what is the probability that the coin will fall clean, which means 
that the coin will not cut any of the edges and corners of the tiles? 


Solution 1.10. In Figure 1.6, a typical square tile is marked. Since the coin is tossed 
upward, we assume that the center of the coin could be anywhere on the tile if the 
coin has fallen on that tile. In other words, we are assuming that the center of the coin 
is a random point on the square tile or uniformly distributed over that square tile. In 
Figure 1.6, an inner square is drawn 2 distance away from the boundaries of the outer 
square. If the center of the coin is anywhere on the boundaries of the inner square or 
in the region between the walls of the two squares, then the coin can touch or cut the 
walls of the outer square. 


a mt Figure 1.6: Square tile and circular coin. 


If the center of the coin is strictly within the inner square, then the coin will fall clean. 
Therefore, the probability of the event, A = the event that the coin falls clean, is given 
by 


, d d 
_ area of the inner square (m- 5 - D |. (m- dy 
area of the outer square m? m ᾿ 


P(A) (1.11) 
This problem is generalized by looking at a floor paved with rectangular tiles of length 
m units, width n units and a circular coin of diameter d units where d « m, d « n. This 
problem can be done ina similar way by looking at the center of the coin and assuming 
that the center is uniformly distributed over the rectangle. The floor can be paved with 
any symmetrical object such as a rhombus or general polygon, and a circular coin is 
tossed. The problem is to compute the probability that the coin will fall clean. 

A three-dimensional generalization of the problem is to consider a prism with a 
square, rectangular, parallelogram or general polygonal base and a ball or sphere of 
radius r israndomly placed inside the prism. Some illustrations are given in Figure 1.7. 
What is the probability that the ball will not touch any of the sides, base or top of the 
prism? When we move from a one-dimensional case to two or higher dimensions, then 
more axioms such as "invariance" is needed to define probability measures. 


30 —— 1 Random phenomena 


Figure 1.7: Tiles of various 
shapes. 


Another basic problem that Buffon had looked into is called Buffon's needle problem. 
A flooris paved with parallel lines, m units apart. A headless needle (or a line segment) 
of length d is tossed up. What is the probability that the needle will touch or cut any 
of the parallel lines when the needle falls to the floor? There are several situations of 
interest here. One is the case of a short needle where the length of the needle, d, is less 
than m. Another case is when d « 2m and d » m. Another case is a long needle which 
can cut a number of parallel lines. Remember that however long the needle may be 
there is a possibility that the needle need not cut any of the lines, for example, the 
needle can fall parallel to the lines. 


Figure 1.8: Buffon's needle problem. 


Another needle problem is when the floor has horizontal and vertical lines making 
rectangular grids of length m units and width n units and a needle of length d 15 tossed 
as shown in Figure 1.8. A generalization of this problem is the case when the needle 
can be of any shape, and need not be straight. 

For dealing with Buffon's needle problem, we need the concepts of random vari- 
ables and independence of random variables. Hence we will not do examples here. 

When we combine geometry with probability, many interesting paradoxes can 
arise. The subject dealing with the combination of geometry and probability is called 
Stochastic Geometry. Students who are interested in this area can look into the book [A] 
and other papers of A. M. Mathai. 


Exercises 1.5 


1.5.1. An unbiased coin is tossed until a head is obtained. What is the probability that 
the experiment is finished in (i) 4 or less number of trials, (ii) in 20 or less number of 
trials? 


1.5.2. A balanced die is rolled 3 times. What is the probability of getting: 
(a) sum greater than 14; 


1.5 How to assign probabilities to individual events? —— 31 


(b) all the face numbers are the same; 
(c) atleast two of the face numbers are the same; 
(d) getting the sequences 666 or 121 or 112? 


1.5.3. An unbiased coin is flipped 3 times. What is the probability of getting (1) exactly 
one head or (2) at least one head? 


1.5.4. A box contains 6 identical chips numbered 1,2,3,4,5,6. Two chips are taken 
one-by-one at random (blind-folded after shuffling well) with replacement. What is 
the probability that (1) the first number is bigger than the second number? (2) the first 
number is less than 5 of the second number? 


1.5.5. What are the probabilities in Exercise 1.5.4 if the sampling is done without re- 
placement? 


1.5.6. In Exercise 1.5.4, what is the probability that (1) the number in the second trial 
is bigger than the number in the first trial and (2) the number in the second trial is 
bigger than that in the first trial, given that the first trial resulted in the number 1? 


1.5.7. A box contains 7 identical marbles except for the color, 4 are red and 3 are green. 
Two marbles are picked at random one by one without replacement. What is the prob- 
ability of getting: 

(a) the sequence RG (red green); 

(b) exactly one red and one green; 

(c) RR (red red); 

(d) exactly 2 red marbles? 


1.5.8. In Exercise 1.5.7 suppose a subset of 2 marbles is taken at random or blind- 
folded by putting the hand in the box and taking 2 together. Answer (a), (b), (c), (d). 


1.5.9. Two identical pieces of string of 20 cm are there. One end of each is marked zero 
and the other end 20. One string is cut at random. Let x be the distance from zero to 
the point of cut. The second string is cut at random. Let y be the distance from zero to 
the point of cut. Find the probability that: 

(i) x«y, (i) x<y, Gii) x+y<10, (iv) x*y230, 

(vy) 10<x<15, (vi) 5<y<20, (vii) 5<x<10and10<y<20, 

(viii) x*+y?<10, (ix) x*+y?=10. 


1.5.10. A floor is paved with identical square tiles of side 10 cm. A circular coin witha 
diameter of 2 cm is tossed up. What is the probability that: 

(a) the coin will fall clean; 

(b) the coin will not fall clean; 

(c) the coin will cut exactly one of the edges of the tiles? 


1.5.11. In Exercise 1.5.10, if the coin is flipped twice, what is the probability that: 


32 —— 1 Random phenomena 


(a) on both occasions the coin will fall clean; 
(b) in exactly one occasion it falls clean; 
(c) onthe first occasion it falls clean and on the second occasion it does not fall clean? 


1.5.12. In Exercise 1.5.10, suppose that the sides of the tiles are m units each and the 
diameter of the coin is d units. What should the connection be between m and d so that 
the game is fair, which means the probability of the coin falling clean is the same as 
the probability it does not fall clean (in such a case, in a game of chance, both people 
betting on each of the two events of falling clean and not falling clean will have the 
same chance of winning at each trial). 


1.5.13. Suppose that the floor is paved with identical rectangular tiles with lengths of 
10 cm and a width of 5 cm and a coin with a diameter of 4cm is tossed. What is the 
probability that the coin will fall clean? 


1.5.14. Suppose that a floor is paved with identical rhombuses of side m units and a 
circular coin of diameter d is tossed. What is the probability that the coin will fall clean 
if d is small such that it can fall clean? 


1.5.15. In Exercise 1.5.13, if the floor is paved with identical equilateral triangles, then 
what will be the corresponding probability? 


1.5.16. Answer the questions (a) and (b) in Exercise 1.5.10 if the tiles are (1) equilateral 
triangles with sides of 20 cm each, (2) parallelograms with sides of equal length of 
20 cm and (3) hexagons with sides of 20 cm each. 


2 Probability 


2.1 Introduction 


In Chapter 1, we have introduced the basic notion of probability. In the present chap- 
ter, we will explore more properties of probability, the idea of conditional probability, 
basic notions of independence of events, pair-wise independence, mutual indepen- 
dence, Bayes' theorem, etc. For the computations of probabilities in given situations, 
we will need some ideas of permutations and combinations. Students may be familiar 
with these aspects but for the sake of those who are not familiar, or forgotten, a brief 
description is given here as Sections 2.2 and 2.3. In Section 2.4, a note on sigma and pi 
notations is given. Those who already know these materials may skip these sections 
and go directly to Section 2.5. 


2.2 Permutations 


To permute means to rearrange and the number of permutations means the number 
of such rearrangements. We shall look into the problem of filling up some positions 
with some objects. For example, let there be r seats in a row and n individuals to be 
seated on these r seats. In how many different ways can we select individuals from this 
set of n individuals to occupy these r seats. For example, suppose that there are r = 2 
seats and n - 5 individuals. The first seat can be given to one of the 5 individuals, and 
hence there are five choices of filling up the first seat. When the first seat is already 
filled, there are 4 individuals left and one seat is left. Hence the second seat can be 
filled with one of the four remaining individuals or in 4 different ways. For each of the 
five choices for the first seat, there are four choices for the second seat. Hence the total 
number of choices for filling up these two seats is 5 x 4 = 20 ways. If A, B, C, D, E denote 
the five individuals and if the first seat is given to A then the sequences possible for 
the two seats are AB, AC, AD, AE. If B is given the first seat again, there are four such 
choices, and so on. We can state this as the total number of permutations of five, taken 
two at a time, and it is 20. We can also state this as the total number of ordered sets 
of two from a set of five or the total number of sequences of two distinct items taken, 
from a set of five items. 

Thus, if there are n individuals and r seats to be filled, r < n, then the total number 
of choices for filling up these r seats with n individuals is n(n - 1)(n - 2) --- (n- (r - 1)). 


Notation 2.1. P(n,r) = „P, = Total number of permutations of n, taken r at a time. 


Definition 2.1 (Permutations). The total number of permutations of n distinct ob- 
jects, taken r at a time or the total number of ordered sets of r distinct items from 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. JBA] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https://doi.org/10.1515/9783110562545-002 


34 ---- 2 Probability 


the set of n distinct items or the total number of sequences of r items from a set of 
n distinct items is given by 


P(n,r) 2 n(n- 1(n-2)---(n- (r-1)) 2n(n-1)---(n-r41). (2.1) 


For example, the total number of permutations of 5 items, taken 3 at a time is 5 x 
4 x 3 = 60. The total number of permutations of 5, taken 5 at a time or all is 5 x 4 x 3 x 
2x12 120: 


P(1,1)=1, P(2,1)=2, P(n1)-n, P(10,2) = (10)(9) = 90, 
P(4,4) = (4)8)(2)() =24=4!, Ρ(-3,2) = no meaning, 


1 
(2, 5) = no meaning. 
Notation 2.2. n! = factorial n or n factorial. 


Definition 2.2. 
n! =(1)(2)---(n), 0! =1 (convention). 


That is, 
2!--(1)(2)-2, 3!=(1)2)8)=6, 4!=()(2)(3)(4) = 24, 
(-2)! = not defined, (5) - not defined, 


and so on. We can also write the number of permutations in terms of factorials: 


n(n-1)---(n-r«1)n-r7)--- (2)(1) 
(n-r)n-r-1)---(2)1) 


by multiplying and dividing by (n — r)(n — r - 1) --- (2)(1). That is, 


P(nr)-n(n-1)--(n-r«1)- 


(n - r) 


P(n,r) = (2.2) 


If we want this formula, in terms of factorials, to hold for all n then let us see what 
happens if we compute P(n, n). Writing in terms of factorials, by substituting r = n on 
the right side of the above equation (2.2), we have 
n! 
P(n,n) = or 
But, from the original definition, 
P(n,n) = n(n-1)--- (2)(1) =n! 


Hence we need the convention 0! = 1 if we want to use the representation of P(n,r) in 
terms of factorials for all r and n. [Mathematical conventions are convenient assump- 


2.2 Permutations — 35 


tions which will not contradict or interfere with any of the mathematical derivation 
or computation. Also note that all computations can be carried out without this con- 
vention also. If we do not want to use the convention, then at equation (2.2) write 
r=1,...,n—1and P(n,n) -- πι. 


Asasimple example, we can consider words in a text. Words of a distinct alphabet 
are an ordered set or ordered sequence of a distinct alphabet. If the alphabet is rear- 
ranged or permuted, then we obtain different words. House numbers in a city, postal 
codes in addresses, etc. are all ordered sets of numbers, and if the numbers are per- 
muted we get other house numbers, other postal codes, etc. 


Example 2.1. How many different 3-letter words can be made by using all ofthe alpha- 
bet in the word (1) *can", (2) how many different 4-letter words can be made by using 
all of the alphabet of the word “good”, (3) how many 11-letter words can be made by 
using all of the alphabet in the word "Mississippi"? 


Solution 2.1. (1) The different words are the following: 
can, cna, anc, acn, nac, nca. 


There are 6 = 3! such words. In (2), we have the letter “ο” repeated 2 times. If the 
o’s were different such as 0,,0,, then the total number of words possible is 4! = 24. But 
0405 Οἵ 050, Will give the same oo. Note that Οι, 0, can be permuted in 2! ways and all 
these permutations will produce the same word. Hence the total number of distinct 
words possible is 
! 
£ =12. 
2! 
In (3), the letter “s” is repeated four times, “i” is repeated 4 times and “p” is repeated 
2 times. Hence the total number of distinct words possible is 
11! 
— — =34650. 
41412! 
Example 2.2. How many different number plates can be made containing only three 
digits if (1) repetition of numbers is allowed, (2) no repetition is allowed. 


Solution 2.2. Anumber plate with 3 digits means filling up 3 positions with one ofthe 
numbers O, 1, ..., 9. The first position can be filled in 10 different ways with one of the 
10 numbers 0,1, ..., 9. When a repetition is allowed, the second and third positions can 
also be filled in 10 different ways. Thus the total number of number plates possible is 


10x10x10-10?-1000 when repetition is allowed. 


When repetition is not allowed, then the first position can be filled in 10 ways, the 
second only in 9 ways and the third position in only 8 ways. Thus the total number of 


36 ---- 2 Probability 


number plates possible is 
10x9x8=720 when repetition is not allowed. 


We can also write the number of permutations by using the Pochhammer symbol, 
which is widely used in mathematical analysis. 


Notation 2.3. (α)χ: Pochhammer symbol, where a is the Greek letter alpha. 


Definition 2.3. 
(a), =a(a+1)(a+2)---(a+k-1), a#0, (α)ο =1. (2.3) 


For example, 


(1), =) 4 κπ--1) -πι (2). = (2)(3)(4) = 24; 


(-2)3 = (-2)(-2 + 1(-2 +2) = 0; (5) ο ΕΕ. 


(0)2Ξ not defined; (3)ο-1; (5)_, = not defined. 


> 


3 
4 


Note that the various factors in the Pochhammer symbol are in ascending order in 
the form a(a + 1)(a + 2):::. Suppose we have factors in descending order such as 
b(b — 1)(b - 2) --- then can we write this also in a Pochhammer symbol. The answer is 
in the affirmative. Consider the following: 


b(b-1)--- (b-k +1) =(-1)¥(-b)(-b +1) --- (-b +k - 1) 
= (-1)¥(-b)x- (2.4) 
With the help of (2.4), we can write the number of permutations in terms of a Pochham- 
mer symbol. The total number of permutations of n, taken r at a time, is given by P(n, r) 
where 
P(n,r)=n(n-1)(n-2)-- (n-r+1)=(-1)(-n)(-n 1). (-n+r-1) 
= (-1)"(-n),. (2.5) 


Exercises 2.2 


2.2.1. Evaluate the following numbers of permutations, if possible: (1) P(4,2); 
(2) PG. 4); (3) P(-5,2); (4) PG, 2; (5) PĠ, 2). 


2.2.2. If there are 20 students in a class, then their birthdays could be any one of the 
365 days 1,2, ...,365. If no two birthdays are the same or if all students have distinct 
birthdays, then how many possibilities are there? 


2.3 Combinations — 37 


2.2.3. How many 3-letter words can be made by using the alphabets of the word, 
(1) mind; (2) big, with (a) no letter is repeated, (b) the letter i is present. 


2.2.4. How many 3-digital number plates can be made (1) with no restriction; (2) no 
numbers should be repeated; (3) one number 5 must be present; (4) the plate should 
start with a number 5. 


2.2.5. In how many ways 10 persons can be seated (a) on the straight line of 4 chairs; 
(b) on a circular table with 4 chairs? 


2.2.6. Evaluate the following Pochhammer symbols: (1) (-5)2; (2) (-5)5; (3) (2; 
(4) (4. 


2.2.7. Convert the following number of permutations into Pochhammer notation: 
(1) P(5, 3); (2) P(10,2); (3) P(5, 0); (4) PG, 5). 


2.2.8. From a box containing 3 red and 5 green identical marbles, three marbles are 
picked at random (i) with replacement; (ii) without replacement. How many sample 
points are there in the sample space? 


2.2.9. In Exercise 2.2.8, if we are interested in the event of getting exactly 2 red and 
one green marble, then how many sample points are there favorable to this event? 


2.2.10. A coin is tossed 3 times. Write down all possible sequences of head H and 
tails T. 


2.3 Combinations 


In permutations, we were interested in the rearrangement or in sequences or in or- 
dered sets or ordered subsets from the given set of objects. Suppose that we are not 
interested in the order but only in the subsets. For example, if we have 3 letters a, b,c 
and if we are looking at the ordered subsets of two letters from these three letters then 
the ordered sequences are 


ab, ac, ba, bc, ca, cb 


or there are 3 x 2 = 6 such ordered sets. Suppose that we are only concerned with the 
subsets of two letters from this set of three letters then the subsets are 


(a,b, {a,c}, {b,c} 


because whether the sequence ab or ba appears it is the same subset of the letters a 
and b. 

How many subsets of r elements are possible from a set of n distinct elements? If 
a subset of r elements is there, then we can order them in r! ways to get all ordered 
sequences. 


38 —— 2 Probability 


Hence the total number of subsets of r elements from a set of n elements - total 
number of permutations of n taken r at a time, divided by r!: 


Pon ποιο. omo. Q.6) 
r! r! ri(n—r)! 


This is known as the number of combinations of n taken r at a time. The standard 
notations used are (7), ,C,, C(n, r). We will use the notation (7). 


Notation 2.4. (7) = the number of combinations of n, taken r at a time = the num- 
ber of subsets of r distinct elements from the set of n distinct elements. 


Definition 2.4. The number of combinations ofn, taken r ata time or the number of 
possible subsets of r distinct elements from a set of n distinct elements, is given by 


(7) 288. 19-0 m-rrn. i (27) 
^ r! r! ri(n—r)! 
- LIA (in terms of Pochhammer symbol). (2.8) 


From this definition itself, the following properties are evident by substituting for r: 


n n n n 
= =1; = =n; 
1 
C i ) €. (2) =not definea; (;) not definea 
2 n-2 2! 2 n 


From the representation in terms of factorials, we have the following results for all r: 


(Gi) mezen > T 
ο... (2110) 
(3-(5. (a(t): CG) mson 


For example, 
100) { 100 \_ (100) _ (100)(99) _ Hoe. 
98 100 - 98 2 2! 
2101. | 210 _ (210 zy 
200) 1210-210 κο) " 


e E E _ 10)(9)(8) (0098) _ 155 
PENS 3 BAM 


2.3 Combinations — 39 


Note that for the definitions of the numbers of permutations and combinations to hold 
both n and r must be non-negative integers, 0,1,2,.... When evaluating the number 
of permutations or the number of combinations in a given situation, do not use the 
representations in terms of factorials, use the basic definitions, that is, 


|. Pnr) n(n-1)-.(n-r«1) 
πε r! : 


P(n,r)=n(n-1)---(n-r+1) and (5) 


The representations in terms of factorials are useful for theoretical developments. 
When factorials are evaluated for large numbers, the computer is not going to give 
you the correct value. It will print out a number, which is the maximum number that 
the computer can handle, and does not need to be equal to the value of that factorial. 
Hence if that large factorial is divided by another big factorial, and not the same as 
the numerator factorial, the computer will give the value as 1. 


Example 2.3. A box contains 7 identical marbles, except for color, of which 4 are red 
and 3 are green. Two marbles are selected at random (a) one by one with replace- 
ment; (b) one by one without replacement; (c) two marbles together. (1) Compute the 
numbers of sample points in these cases; (ii) compute the probabilities of getting the 
sequence (RG) - (R - red, G - green) in (a) and (b); (iii) compute the probabilities of 
getting exactly one red and one green marbles in (a), (b) and (c). 


Solution 2.3. (a) It is like filling two positions with 7 objects. The first position can be 
filled in 7 ways, and since the first marble is put back, the second position can also be 
filled in 7 ways, and hence the total number of sample points is 7? = 49 and the sample 
space consists of all such 49 pairs of marbles. 

(b) Here, the sampling is done without replacement and hence the first position 
can be filled in 7 ways and the second in 6 ways because the first marble is not put 
back. Hence the total number of sample points here is 7 x 6 - 42 and the sample space 
consists of all such 42 pairs of marbles. 

(c) Here, we are looking at all possible subsets of 2 items from a set of 7 items, and 
hence the sample space consists of all such subsets of 2 items and the total number of 
sample points is 


2 DO R _ 
2 2! 2 


In order to compute the probabilities, we will assume symmetry in the outcomes be- 
cause of the phrase “at random”. Hence in (a) all the elementary events get the prob- 
abilities of 5 each, in (b) 5 each and in (c) x each. Now we need to compute only 
how many sample points are favorable to the events. 

(ii) If the first marble has to be red, then that can only come from the set of red 
marbles, and hence there are 4 choices to fill the first position and similarly there are 


40 ---- 2 Probability 


3 choices to fill the second position and the number of sample points favorable to the 
events in (a) and (b) is 4 x 3 - 12. Hence the required probabilities in (a) and (b) are the 
following: 


EL (a) and τ for (b). 


For answering (iii), one has to look into all possible sequences of getting exactly one 
red and one green in (a) and (b) and all subsets containing exactly one red and one 
green in (c). Exactly one red and one green can come from the two sequences RG and 
GR and the number of sample points favorable to the event in (a) and (b) is 4 x 3 = 12 
plus 3 x 4 = 12, equal to 24. Hence the required probabilities in (a) and (b) are the 
following: 

24 4 


= * for (b). 
p 9 


In (c), the total number of sample points favorable to the event of getting exactly one 
red and one green marble is the following: One red can come only from the set of red 
marbles and this can be done in (1) - 4 ways and similarly the one green can come 
in (2) = 3 ways. Thus the total number of sample points favorable to the event is 12. 
Hence the required probability of getting exactly one red and one green marble is 


15.4 
2 7 
Observe that sampling without replacement and taking a subset of two produced the 
same result. This, in fact, is a general property. 


ee for(a) and 
49 


Exercises 2.3 


2.3.1. From a deck of 52 playing cards (13 diamonds, 13 spades, 13 clubs, 13 hearts) 
a hand of 8 cards is to be taken. (a) How many possibilities are there in making this 
hand of 8? (b) How many possibilities are there in making a hand of 8 consisting of 5 
spades and 3 clubs? 


2.3.2. A committee of 5 people is to be formed consisting of 3 women and 2 men. There 
are 10 men and 5 women available for selection. In how many ways can this committee 
be formed? 


2.3.3. The 6/36 lottery is where there are 36 specific numbers and 6 numbers will be 
selected at random one-by-one without replacement or a subset of 6 numbers from 
the set of 36 numbers is taken. How many points are there in the sample space? 


2.3.4. The7/49 lottery consists of 49 specific numbers and a subset of 7 is taken, either 
together or one-by-one without replacement. (a) How many sample points are there in 
this experiment? (b) If someone wishes to buy one lottery ticket to play 6/36 or 7/49, 
should she buy a ticket from 6/36 or 7/49 and why? 


2.4 Sum Y and product [[ notation —— 41 


2.3.5. Show that 


(1): Σ H =16; (2): Σ G) £39) (3): » | -- 25, 


2.3.6. Show that 
SOE) 
ολ. 


2.3.7. Suppose there are r indistinguishable balls and n boxes. Balls are put into the 
boxes without any restriction. A box may receive none, one or more balls. In how 
many ways r indistinguishable balls can be distributed into n boxes and show that 
it is (ner y. 


2.3.8. Compute the combinations in Exercise 2.3.7 for (i) r = 2, n = 3; (ii) r = 4, n2 3; 
(iii) Verify the results in (i) and (ii) by the actual count. 


2.3.9. Evaluate the following sum: 


2.4 Sum ¥ and product [| notation 
Notation 2.5. Σ: notation for a sum. 
Definition 2.5. 57 ,aj-d, +a, t t dg. 
The standard notation used for a sum is } (similar to Greek capital letter sigma). 
For example, if x is any element of the set (2, -1, 0,5}, then 


Σχ - sum of all elements in the set = (2) + (-1) + (0) + (5) - 6. 


If a, = 50kg, a; = 45kg, a4 = 40kg, a, = 55kg, ας = 40 kg denote the weights in kilo- 
grams of five sacks of potato, then the total weight of all the five sacks will be the sum, 
which can be written as 


5 
Σα) =a; +a, + 3 + a4 + a = 50 +45 + 40 + 55 + 40 = 230kg. 
ja 


42 —— 2 Probability 


Here, the notation Σ a; means write the first number, which is a, or α forj - 1, add to 
it the number for j = 2, and so on until the last number. Thus 


n n n 

ΣΧ 
ja i=1 k=1 

the subscript can be denoted by any symbol i,j,k, etc. because in the notation for the 

sum or called the sigma notation the subscript is replaced by 1,2, ... and the successive 

numbers are added up. If ει = Rs100, c; = Rs250, c, = Rs150 are the costs of three 

items bought by a shopper then the total cost is 


3 
$ cj = Cy + Cy + C3 = 100 + 250 + 150 = Rs500. 

j=l 

If the shopper bought 4 items, all were of the same price Rs 50, then as per our notation 


4 
2:50 =50 +50 + 50 + 50 = 4 x 50 = Rs 200. 
j=l 
Thus one property is obvious. If c is a constant, then 
n 
X c=nxc=nc. (2.11) 
j=l 
Suppose that the first day a person spent a, = Rs20 for breakfast and b, = Rs 35 for 
lunch. In the second day, he spent a, = Rs25 for breakfast and b, = Rs 30 for lunch. 
Then the total amount spent for the two days is given by 


2 2 


2 
Ya; + bj) = (αι + b) + (az + bz) = Σα t 2 bi 
i-1 


i-1 i=1 
= (20 + 25) + (35 + 30) = Rs 110. 


Hence another general property is obvious 


n n n 


Σία t bj) = Σα t $b. (2.12) 


j=1 j=1 j=1 


Another property is the following: 


k k k k k 
$ caj=c} aj = clay +- + ap); Σία) + db) - ο Σα) « dy. b; (2.13) 
jal jal jal [s Εἰ 


where c and d are constants, free of j. The average of a set of numbers Χη,..., Χρ, de- 
noted by x, can be written as 


For example, if the numbers are 2, -3,5 then as per our notation 


2.4 Sum ¥ and product [[ notation —— 43 


, OQ * C926) 4 


X22, X,-2-3, X4-5, n-3 and 3 E 


Let us see what happens if we consider δα, (x; - X) here: 


3 
ΣΩ, --2) 208 -- ἃ) 4 65 -3) + 05 - X) 
j=l 


ic cL 


by adding up all terms by putting j = 1, j = 2,... and the sum is 


= [(2) C3) + (5)] -3(3)=4-4=0. 


This, in fact, is a general property. Whatever be the numbers x4, Χ2....,Χῃ: 


205-3-2»x-53x-23-2x-0 (2.15) 


n (X X ) n 
yx-n-n Er δα 
4 n ΄ 
j=l j=l 
Whatever be the numbers x4, x5, ... Xs, 
n 
)xededeesex (2.16) 


n 2 
x | =O te x3 HK H EP + WGK He Ε2ΧΙΧ 
j 1 n 1 n 12 1^n 


+ 2X3X3 + ++- + 2Χ}Χῃ + 2X4 1Χῃ 


n n 
= 2 X; = 2 X; 
=) x +2) xix; =) x +2) χρη 
jl j=l 


i<j dj 
n n 
Ξ Σ Σκι. (2.17) 


For example, the sum Σ; <j Xj Means to take the sum of product of all terms where the 
first subscript is less than the second subscript or i « j. It is a double sum involving i 
and j but subject to the condition i « j. That is, for example, 


Dia Gp ped gd 
(X1 tX) X34) =X] + X5 1 XZ + 2XyXq + 2X1X3 + 2X3X3 


which is the same as saying 


3 


2 N' 32 
(+x +X) = Σχ * 22 xX; 
ja i<j 


44 — 2 Probability 


κο A 2) ny? 
= ΧΙ +X +X + 2Χ4ΧΙ + 2Χ}Χ1  2Χ4Χ2 


= X1X1 + Χ2Χ2 + X3X3 + X1X2 + XX, 
+ 


X1X3 + Χ1ΧΙ + X2X3 + X3X2 


which is the same as saying the double sum without any restriction on i and j, that is, 
ran λα] (x;x;). Some of the general properties of the sigma notation are the following: 
For any set of numbers a4, a3, ...,b,, ba, ... 


n 
(aibi) 2a b rab, cas by (2.18) 


i=1 


+ + apml[bi + + by] = [by +--+ + bp] la +--+ + ay] 
=) > jai) (2.19) 


or, in other words, we could have opened up the sum with respect to i first or j first the 
result would have remained the same. 


If we have two or more subscripts, the sigma notation will be the same type. For 
example, let w;; (which is also written as w; without the comma between the sub- 
scripts if there is no possibility of confusion) be the weight of the i-th individual in the 
j-th age group. Suppose we have numbered some people from 1 to 40, say, i = 1,..., 40 
and categorized into 5 categories according to their ages such as age less than or equal 
to 20 in group 1, greater than 20 but less than or equal to 30 in group 2, greater than 
30 but less than or equal to 40 in group 3, greater than 40 but less than or equal to 50 
in group 4, greater than 50 in group 5. Then j = 1,2,3, 4,5. We have wo; the weight of 
the 10th person in the 5th age group, and if her weight is 55 kg then wi), = 55. Then 
the total weight of all the individuals is given by 


2.4 Sum Y and product [[ notation —— 45 


that is, we can sum up i first or j first and itis also 
—Wi t Wy tee + Wis Wag too Wists Wigs 
—WQ + Woy Wai + W12 +: + Wao, t s Waos. 
Thus we have the following general rule: 


mn m 
2258-2, 
iz1j-1 i=1 

n 


; IN (2.20) 
1511 ἱ 


2.4.1 The product notation or pi notation 


Just like the notation for a sum, calling it the sigma notation, we have a notation for a 
product, calling it the pi notation. 


Notation 2.6. |]: the product notation. 


Definition 2.6. 


n 


l[g24xax-- X Ay = 0405 ++ dg. 
A 


For example, if a, = 5, a = -1, a4 = 2 then 


3 


Į [4 = 414245 = 9-02 =- 
ji 


If a, = a, = a; = 5, then Π} 14 = (5)(5)(5) = 5? = 125. Thus, in general we have the fol- 
lowing result: 


n n n 
Lese [[ε1ε[]1 0.21) 
ja ja ja 
whenever c is a constant. 


Į [a +b) - (a +b) @+by)- (a, tbn] Jaj] [b 229 
j1 2 


j=l 


(ο an 


But if the brackets are not there, then let us see what happens. 


46 ---- 2 Probability 


ΠαΠ» Ξ («1I»(e 115) -e(T1o) 


i=1 ΕΙ 


opening up i first, and it is also = a,a>(b,b,b;)*. Let us see what happens if we open 
up j first. 


s (nee) 


3 
- bib (TT a) = b,byb3(a,a>)?. 


i=1 


[el] 


i-1 j=l 


Hence it is clear that if product notations are written without brackets then a product 
need not determine a unique quantity, or the notation becomes meaningless. Hence 
remember to put proper brackets at appropriate places, otherwise the notation can 
become meaningless: 


2 2 2 
IC - c) = (a, - c)(a, - c) = aa, - είαι + a3) + c? + [αι - [Te Ξαιᾶ;-- ο. 
μη 


fel i=] 


| [@-4)) = (x — a,)(X- a3) ++ (x -a,). (2.24) 
jel 


] [ ca; = (cay)(cay) -++ (Cay)] = c"a, -ap +c] [ aj. 
jel 


ja 


Exercises 2.4 
2.4.1. If x € {3,5, 0} and y € {-2,—5, 0,6}, then compute (i) Σ x; (ii) Y y; (iii) Y (x t+ y). 


2.4.2. If X 2 2, x; 2 3, X = 0, x, = 5, then compute 
(Ὁ x; 
(1) Dhi- 95 
(iii) ee 
(iv) X cas -Xy 
(v) PM |x;| (absolute value means the magnitude without the sign, when χι is real, 
that is |6| = 6, |-6| = 6, |0|= 0, l-41 - 4, |-3?] = 3? - 9); 
ΕΓ 4 = 
(vi) Xi bx - xl. 


2.4.3. For general numbers x,, ...,x,, derive the following general results: 


n n n 1 n 2 
ο no? - Y - (x); 
ΕΙ ja ja ja 
1 


bye | *Y05-2; (225) 
jal 


2.4 Sum Y and product [[ notation —— 47 


[Give counter examples wherever something is to be disproved.] 


2.4.4. Evaluate the following: 

G) IT2,(0); 

(ii) TI}; (a; - 3) where αι 25, a; = 0, a; = 22; 

(iii) [T2., (a; -- bj) where (a,,b,) = (2,3), (a, δ) = (5,1), (a3, b3) = (1, 2), and show that 
it is not equal to [T? , a; - [7.4 δι. 


2.4.5. Write the following by using a double product notation: 
(a; - a5)(d, — a3) ++ (a, - a)(a5 - d3) +++ (42 - An) + (an. = An). 


2.4.6. Open up the following and write as a sum: 
(i) (x-aj)(x -ax - a3); 
{9 (x-a4)(x- a3) --- (x - αι). 


2.4.7. Open up the following: 
G (QIa)0 5): 

Gi) (ΠΕ a) Σι b; 

(1) TT, a4 bj 

Gv) TT? a; Nui bj. 


2.4.8. Evaluate the following: 
(i) It (αι τ b;)]; 
(1) TMi: - δῃ]. 


2.4.9. Ifa, =2, a; = -1then evaluate 
© Ti αρ” and 
(i) Ti-a. 


2.4.10. For paired values (4,y4), ..., (x4, Yn) Show that 


n n 
Σα, -x)0y-»- Yxy; - nxy. 
ΕΙ ΕΙ 


Example 2.4. In Example 2.3, suppose that the marbles are taken at random, one by 
one, without replacement. What is the probability that (a) the second marble taken is 
green, given that the first marble removed is a red marble? (b) the second marble is 
green? 


48 — 2 Probability 


Solution 2.4. Let A be the event that the first marble removed is red and let B be the 
event that the second marble is green. (a) If it is already known that the first marble 
removed is red then there are only 6 marbles left in the box, out of which 3 are green, 
and hence the required probability is 2 = 1. That is, 


P(B given that A has occurred) = - 2 


What is AnB here? This is the event that the first marble is red and the second marble is 
green or getting the sequence RG. This probability is already evaluated in Example 2.3. 
That is, 
P(AnB)- is a 
42 7 

What is the probability P(A) that the first marble is red? This can be computed ei- 
ther looking at the first trial alone, where there are 7 marbles out of which 4 are red, 
and hence the probability is P(A) = 1. We can also look at it after the completion of 
the experiment of taking two marbles one by one without replacement. Then the first 
marble is red if we have the sequence RG or RR. The total number of points favorable 
to this event is RG giving 4 x 3 = 12 plus RR giving 4 x 3 = 12, and hence the probabil- 


ity 


P(A) = 12412 E 4 
42 7 
One interesting property may be noted from the above. That is, 
12 . 4 3 
P(AnB)- 25 = P(A)P(B given that A has occurred) = 7 x A 


This is a general property that we shall discuss next, after checking (b). 

(b) Here, we need the probability for the second marble to be green. This can 
happen in two ways, under the sequence RG or GG. The sample points favorable to 
these two sequences is 4 x 3 = 12 plus 3 x 2 = 6 or 18. Hence 

pp - 8.3 
42 7 
It is equivalent to taking one marble at random and the probability for that marble 
being green. 


2.5 Conditional probabilities 


We will examine probability of the type B given A or the probability of an event given 
that some other event has occurred. In some cases that information will change the 
probability, that is, the probability of a conditional statement and that of an uncon- 
ditional statement may be different, as seen from Example 2.4. We will introduce a 
formal notation and definition for such conditional statements here. 


2.5 Conditional probabilities —— 49 


Notation 2.7. P(B|A) = probability of B given A = the conditional probability of B 
given that A has already occurred, where A and B are two events in the same sample 
space. 


Here, the notation is a vertical bar after B and it should not be written as B/A or 
2 and these have no meaning when A and B are events. Conditional probability can 
be defined in terms of the probability for simultaneous occurrence and the marginal 
probability or the probability of the conditioned event. 


Definition 2.7. The conditional probability of B given A is the probability of the 
simultaneous occurrence of B and A divided by the probability of A when P(A) + 0. 
That is, 


ΡΙΒΙΑ) - Ë En = P(A)#0, => P(ANB)=P(A)P(BIA) => 
Ρ(ΑΩ Β) = P(A)P(BIA) = P(B)P(A|B), P(A)20, P(B) 40. (2.26) 


Thus the probability of intersection can be written as the conditional probability times 
the marginal probability of the conditioned event. This rule can be extended to any 
number of events: 


P(AnBnC)- P(A|Bn C)P(BnC) = P(A|Bn C)P(B|C)P(C), 
P(BNC)#0, P(C)#0. (2.27) 


Extending this result, we have 


P(A, NA,NA3ZN ++: n Aj) = P(A,A, n A34 n Ώ Αχ), 
P(A N- NA) #0 
= P(AjA, n--- n A)P(A, A3 N ::: n Αχ) 
x +++ P(A, lA PPA), 
P(A, N- Ay) 0, .., Ρ(Α) 10. (2.28) 


Example 2.5. A box contains 4 red and 3 green identical marbles. Marbles are taken at 
random one by one (a) without replacement; (b) with replacement. What is the prob- 
ability of getting (i) the sequence RRG; (ii) the sequence RGR; (iii) exactly 2 red and 
one green marble. 


Solution 2.5. (i)(a) Let the marbles be selected at random without replacement. Let 
A be the event that the first marble is red, B be the event that the second marble is 
red and C be the event that the third marble is green. Then the sequence RRG means 
ANBNC. By using the rule in (2.28), we have 


50 —— 2 Probability 


P(AnBnC)-P(C|Bn Α)Ρ(ΒΙΑ)Ρ(Α) = 


= P(A)P(B|A)P(C|AN B) = E 
For the first marble being red is P(A) - ; because there are 7 marbles out of which 
4 are red and the marbles are picked at random. If one red marble is removed, then 
the probability of getting another red marble is P(B|A) = 2 because there are only 6 
marbles left out of which 3 are red. If two red marbles are removed, then there are only 
5 marbles out of which 3 are green, and hence P(C|A n B) = 5. By a similar argument, 
the probability in (ii)(a) is 


P(IRGR)) = 5 

(i), (ii)(b) Let the marbles be taken with replacement. If marbles are returned each 
time, then the probability remains the same. Then probability of getting a red in any 
trial is 7 and the probability of getting a green in any trial is 5. The occurrence or non- 
occurrence of an event in the first trial does not affect the probability of occurrence of 
an event in the second trial, as so on. Again, by using the same formula (2.27), we have 


sp-$33- (5). sem-122-$- (9) 


(iii) Note that exactly 2 red and one green, out of three marbles taken can come in 


ways. These are the sequences RRG, RGR, GRR. By using (2.27), we see that for each of 
these sequences the probability remains the same. Hence the probabilities for (iii)(a) 
and (iii)(b) are respectively, 


2 
TE ma sx(5)(3). 
35 35 7 


Example 2.6. In Example 2.5, suppose that three marbles are taken together at ran- 
dom. What is the probability of getting exactly 2 red and one green marbles? 


Solution 2.6. This is a matter of selecting subsets of size 3 or of 3 elements. The total 
number of sample points possible is 


7 
-156x35 35, 
3 1x2x3 


The total number of sample points favorable to the event is 


(0)-(£9)«o 


2.5 Conditional probabilities —— 51 


because the red marbles can come only from the set of red marbles and there are 4 
red and a subset of 2 is taken, which can be done in (4) ways and similarly the green 
marbles can be selected in (?) ways. Note that for each selection of red, the green can 
be selected in (1) ways and vice versa, and hence the total number of sample points 
favorable to the event is the product of the two combinations. Hence the required prob- 
ability in (iii)(a) is 


When sampling is done with replacement then the probability of getting a red marble 
at any trial is 7 and the probability of getting a green marble at any trial is 5. The total 
number of ways of getting 2 red or 1 green in 3 trials is (7) -- (2). Hence the answer for 


69-35) 3) 


Definition 2.8 (Statistical independence or product probability property (PPP)). If 


P(An B) = P(A)P(B) (2.29) 


then the events A and B are said to be independent events or said to satisfy the 
product probability property. If three events A, B, C are such that 


Ρ(ΑΩΒ)ΞΡΙ(Α)Ρ(Β), P(ANC)=P(A)P(C), P(BNC) =P(B)P(C) (2.30) 


then A, B, C are said to be pairwise independent events. In addition to (2.30) if fur- 
ther, 


P(ANBNC) = P(A)P(B)P(C) 


then the events A, B, C are said to be mutually independent events. Pairwise indepen- 
dence need not imply mutual independence. A set of events ΑἹ, Α»;,..., Αχ are said 
to be mutually independent events if for all subsets of the set {A,,...,A;,} the prod- 
uct probability property holds or the probability of the intersection is the product 
of the probabilities of individual events, that is, 


P(A; n n Aj) = P(A,) + P(A; ) (2.31) 


for all different subscripts (i,,...,i,), r = 2,...,k. This means for every intersection 
of two, three, ..., k distinct events the probability of the intersection is the product 
of the individual probabilities. 


From the following figure, it can be seen that pair-wise independence need not 
imply mutual independence. 


52 —— 2 Probability 


Figure 2.1: Pairwise and mutual independence. 


In Figure 2.1 (a), a sample space with symmetry in the outcomes and with 20 sample 
points, three events A, B, C, is given. The numbers in the various regions indicate the 
numbers of points falling in various regions. Each of the 20 sample points has a prob- 
ability of 5 each. A total of 10 sample points fall in each of A,B and C. Five points 
each fall in the intersections A N B, ANC, BNC, three sample points fall in An BNC 
and two sample points are in the complementary region of A u B u C. Note that 


P(A) = 3 = 5 = P(B) = P(C); 
P(ANB)= Ž = 7 = P(A)P(B); 
PANC) = > = $ = P(A)P(C); 
ni. A ee 

P(AnBnC)- x4 + P(A)P(B)P(C) = 


Hence A, B, C are pair-wise independent but not mutually independent. 

Some students may be thinking that P(A n B n C) = P(A)P(B)P(C) is sufficient to 
guarantee mutual independence. This is not sufficient. In Figure 2.1 (b), let us assume 
symmetry and let the numbers of elementary events be as shown there in the sets 
A,B,C, 6in A, 6 in B, 4 in C and one outside, thus a total of 12 points. Then 


6 1 4 1 
P(B-—-2 P(C)=—=-; 
τσ FO-5734 


1 
ο. 
P(AnBnC)- a = P(A)P(B)P(C); 
<A 
P(AnB)- m =z + P(A)P(B) = =. 
Hence P(A N Bn C) = P(A)P(B)P(C) need not imply P(A n B) = P(A)P(B). 


Note 2.1. Independence of events should not be confused with mutually exclusive 
events. The phrase “independent” is one of the unfortunate terms in statistical litera- 


2.5 Conditional probabilities —— 53 


ture. This can create a wrong impression in the minds of students as if the events have 
nothing to do with each other or they are mutually exclusive. When we say that the 
events A and B are independent they depend on each other a lot, the dependence is 
in the form of a product probability or the probability of intersection is the product of 
the individual probabilities, that is, 


P(AnB) - P(A)P(B). 


Hence the students may observe that A and B depend on each other through this prod- 
uct probability property (PPP), and hence this author has suggested to replace “inde- 
pendence of events" with events satisfying product probability property. The students 
must keep in mind that 


independence of events has nothing to do with mutually exclusiveness of events. 


Two events can be mutually exclusive and not independent or mutually exclusive and 
independent or not mutually exclusive and independent or not mutually exclusive and 
not independent. 


Now the students may wonder from where this word "independent" originated. 
This has to do with conditional statements. We had defined conditional probability of 
A given B as 
P(AnB) 

P(B) 
Now, if the product probability property holds then P(A n B) = P(A)P(B). Then in this 
case 


P(A|B) = for P(B) 40. 


P(A)P(B) 
P(B) 
This means that conditional probability of A given B is the same as the marginal or 
unconditional probability of A when the product probability property holds. In other 
words, the probability of A is not affected by the occurrence or non-occurrence of B 
and in this sense, independent of the occurrence of B. This is from where the word 
“independent” came in. But this word has created a lot of confusion when this con- 
cept is applied in practical situations. Hence it is much safer to say when the product 
probability property or PPP holds instead of saying when there is independence. We 
have given examples of sampling with replacement where PPP holds or where the 

events are independent. 


Ρ(Α|Β) = -P(A) when P(B) 40. (2.32) 


2.5.1 The total probability law 


Two important results on conditional probability are the total probability law and 
Bayes' theorem. Both deal with a partitioning of the sample space. Let a sample space 


54 —— 2 Probability 


be partitioned into mutually exclusive and totally exhaustive events Αι,...,Αχ and let 
B be any other event in the same sample space as in Figure 2.2. From the Venn diagram 
one may note that B is partitioned into mutually exclusive pieces B n A,,B n A,, ..., 
Bn A,, some of which may be empty. That is, 
S-AQUAS,U--UA,. 
Ain Aj =@, foralli#j=1,...,k. 
B=(BNA,)U(BNA,)U-:-U (BN Ap), 
(BAA; N (ΒΑ) =@, foralli#j=1,...,k. 


Figure 2.2: Total probability law. 


Hence by the second and third axioms of probability we have 
P(B) = P(BNA,) + P(BN Αχ) + +--+ Ρ(Β Γ Αχ) (2.33) 
which can be written, by using conditional probability, as 


P(B) = Ρ(ΒΙΑΙ)Ρ(Αι) + P(BIAz)P(A2) 
+++: + P(BIA,)P(A,). (2.34) 


for P(A;) + 0, j = 1,...,k. This equation (2.34) is known as the total probability law 
where the probability of the event B is split into a sum of conditional probabilities of B 
given A,, ..., Αχ and marginal probabilities of A4, ... , Αχ. Itis a probability law connect- 
ing conditional probabilities and marginal probabilities when the marginal probabil- 
ities are non-zeros. We can get many interesting applications of this probability law. 


Example 2.7. Dr Joy is not a very good medical practitioner. If a patient goes to him, 
the chance that he will diagnose the patient's symptoms properly is 3096. Even if the 
diagnosis is correct his treatment is such that the chance of the patient dying is 6096 
and if the diagnosis is wrong the chance of the patient dying is 95%. What is the prob- 
ability that a patient going to Dr Joy dies during treatment? 


Solution 2.7. Let A, be the event of a correct diagnosis, and A; that of a wrong diag- 
nosis. Then A, NA, = $, A, UA, = S the sure event. Let B be the event of a patient of 
Dr Joy dying. Then the following probabilities are given: 


P(A,)=0.3, P(A)-07, P(B|A)-0.6, P(BIA;) - 0.95. 


2.5 Conditional probabilities —— 55 


We are asked to compute the probability of B. By the total probability law, 
P(B) = Ρ(ΒΙΑΙ)Ρ(ΑΙ) + P(B|A;)P(A;) = (0.6)(0.3) + (0.95)(0.7) = 0.845 


or the chance of the patient dying is 84.5%. 


Example 2.8. Mr Narayanan is a civil engineer with Kerala government. He is asked 
to design an over bridge (sky way). The chance that his design is going to be faulty 
is 60% and the chance that his design will be correct is 40%. The chance of the over 
bridge collapsing if the design is faulty is 90%; otherwise, due to other causes, the 
chance of the over bridge collapsing is 20%. What is the chance that an over bridge 
built by Mr Narayanan will collapse? 


Solution 2.8. Let A, be the event that the design is faulty and A, be the event that the 
design is not faulty. Then A, n A, = $ and A, UA, = S a sure event. Let B be the event 
of the over bridge collapsing. Then we are given the following: 


P(A,)=0.6, P(A,)=0.4, P(BIA,)=0.9, P(B|A;) =0.2. 
We are asked to compute the probability of B. From the total probability law, 
P(B) = P(B|A))P(A,) + P(B|A;)P(A,) = (0.9) (0.6) + (0.2)(0.4) = 0.62. 


There is a 62% chance of the over bridge designed by Mr Narayanan collapsing. 


2.5.2 Bayes’ rule 


Consider again the partitioning of the sample space into mutually exclusive and to- 
tally exhaustive events A,,...,A, and let B be any event in the same sample space. In 
the total probability law, we computed the probability of B. Let us look into any one 
intersection of B with the Ap's, for example, consider B N Αι. From the definition of 
conditional probability, we can write 


Ρ(ΒΩΑΙ) = Ρ(ΑΙΙΒ)Ρ(Β), P(B) + 0. 


Therefore, 


Ρ(ΒΩΑΙ) 
P(B) 
_ P(BIAP(A,) 
P(B) 
z Ρ(ΒΙΑΙ)Ρ(Αι) 
Ρ(ΒΙΑΙ)Ρ(ΑΙ) + = + Ρ(ΒΙΑχ)Ρ(Αχ)}᾽ 
P(A) #0, je1...k. 


P(A,|B) = , P(B)#0 


(2.35) 


56 ---- 2 Probability 


This equation (2.35) is known as Bayes' rule, Bayes' law or Bayes' theorem. It is named 
after a Christian priest, Rev. Bayes, who discovered this rule. The beauty of the result 
can be seen from many perspectives. It can be interpreted as a rule connecting prior 
and posterior probabilities in the sense that probabilities of the type P(A;|B) can be 
interpreted as posterior probabilities or the probability of the event A; computed after 
observing the event B and P(A;) can be called prior probability of A; or the probabil- 
ity of A; computed before observing the event B. Bayes' rule also provides an inverse 
reasoning or establishes a connection between probabilities of the type P(A,|B) and 
Ρ(ΒΙΑΙ), where one can be interpreted as the probability from cause to effect and the 
other from effect to cause. 

If a patient died and if the relatives of the patient felt that the medical doctor at- 
tending to the patient was incompetent or the hospital was negligent, then they would 
like to have an estimate of the chance that the patient died due to the negligence or 
incompetence of the doctor, etc. What is the probability that the diagnosis was wrong 
given that the patient died? In the case of a bridge collapsing, the concerned general 
public may want to know the chance that the engineer's design was in fact faulty in 
the light of the bridge collapsing. 


Example2.9. In Example 2.7, what is the probability that Dr Joy's diagnosis was 
wrong in the light of a patient of Dr Joy dying? 


Solution 2.9. Here, we are asked to compute the probability P(A,|B). But 


P(BIA)P(A) _ P(B|A;)P(A;) 

P(B) P(B|A,)P(A,) + P(BIA))P(A;) 
_ (0.95)(0.7) 0.665 133 
~ 0.845 0.845 169 


Ρ(Α2|Β8) = 


z 0.787. 


There is approximately a 78.796 chance that the doctor's diagnosis was wrong. There 
is a very good chance of a successful lawsuit against the doctor. 


Example 2.10. In Example 2.8, what is the probability that the design of Mr Naraya- 
nan was faulty, in the light of an over bridge designed by him collapsing? 


Solution 2.10. Here, we are asked to compute the probability P(A,|B). From Bayes’ 
rule, we have 


P(A, |B) = PBPA) _ Ρ(ΒΙΑΙ)Ρ(ΑΙ) 
d P(B) P(BIA,)P(A,) + P(BIA))P(A;) 
.(09)0.9 54 0.87 
(0.62) 62. 770907 


There is approximately a 87% chance that Mr Narayanan’s design was faulty. But there 
is no way of taking any action against Mr Narayanan due to job security in the Kerala 
system even if several people also died due to the collapse of the over bridge. 


2.5 Conditional probabilities —— 57 


2.5.3 Entropy 


Another concept associated with a partitioning of a sample space or a system 


(Ay, Dy); eee (Ax; Dy), 


where A,,...,A, are mutually exclusive and totally exhaustive events (a partitioning 
of the sample space) and p4, ...,p, the associated probabilities, that is, P(A;) = p;, 
j=1,...,k such that p; > 0, p, + --- + py =1, is the concept of entropy or information 
or uncertainty. This can be explained with a simple example for k = 2. 

Suppose that Mr Nimbus is contesting an election to be the chairman of the local 
township. Suppose that the only two possibilities are that either he wins or he does 
not win. Thus we have two events A,,A, such that A, n A, = $, A; UA, = S where S is 
the sure event. Three local newspapers are predicting his chances of winning. The first 
newspaper gave a 50—50 chance of his winning, the second gave a 80-20 chance and 
the third gave a 60-40 chance. That is, if Αι is the event of winning and p = Ρ(Αι) 
the true probability of winning, then the three estimates for this p are 0.5,0.8,0.6, 
respectively. We have three schemes here: 


Scheme 1: © A2 ) ; 
0.5 


0.5 
Scheme 2: Ar Az ; 
0.8 0.2 
A A 
Sch 3: 1 cole 
der: [o ο) 


In Scheme 1, there is quite a lot of uncertainty about the win, because it is a 50-50 
situation with maximum uncertainty, a 5096 chance of winning. In Scheme 2, the un- 
certainty is much less because it is a 80-20 situation. Whatever be that “uncertainty”, 
one can say this much that in Scheme 3 the uncertainty is in between the situations in 
Schemes 1 and 2. Lack of uncertainty is the “information” content in a scheme. If one 
can come up with a mathematical measure for this information" content in a scheme, 
it has a lot of applications in practical situations such as sending a wireless message 
from one point and it is captured at another point. One would like to make sure that the 
message is fully captured in every respect or at least the information content is maxi- 
mum. If a photo is transmitted, we would like to capture it with all of its full details. 

Shannon in 1948 came up with a measure of “uncertainty” or “information” in a 
scheme. He developed it for communication networks. The measure is 

k 
S--cY pinp; 


i-i 
where c is a constant and In is the natural logarithm. He developed it by putting for- 
ward desirable properties as axioms, postulates or assumptions, and then deriving 
the expression mathematically. A whole discipline is developed and it is now known 


58 ---- 2 Probability 


as "Information Theory" with a wide range of applications in almost all fields. The 
measure is simply called “entropy” in order to avoid possible misinterpretation if the 
term “information” or “uncertainty” is used. Such a procedure of developing a mathe- 
matical measure from basic postulates is called an axiomatic development. Axiomatic 
development of the basic concepts in Information Theory and Statistics may be seen 
from the book [14]. Shannon’s theory is extended also in various forms, some of which 
are given in the above mentioned book and some more applications may be seen from 
a recent paper by Mathai and Haubold [8]. 


Exercises 2.5 


2.5.1. An unbiased coin is tossed 10 times. What is the probability of getting (i) exactly 
4 heads?; (ii) exactly 2 tails?; (iii) first head at the 10th trial?; (iv) the 3rd head at the 
10th trial? 


2.5.2. Aclass has 10 students. Each student has a birthday which can be one of the 365 
days of the year, and no other information about the birthdays is available. (i): A stu- 
dent is selected at random. What is the probability that the student has the birthday 
on 1 February? (ii) What is the probability that their birthdays are all distinct, none 
coinciding with any other? 


2.5.3. Inanumber lottery, each ticket has 3 digits. When the lottery is drawn, a specific 
sequence of 3 digits will win, the digits could be repeated also. A person has bought 4 
tickets. What is the probability that one of his tickets is the winning ticket? 


2.5.4. In Exercise 2.5.3, if repetition of the numbers is not allowed, then what is the 
answer? 


2.5.5. From a well-shuffled deck of 52 playing cards, a hand of 8 is drawn at random. 
What is the probability that the hand contains 4 clubs, 2 spades, 1 heart and 1 dia- 
mond? 


2.5.6. Inagame, an unbiased coin is tossed successively. The game is finished when a 
head appears. What is the probability that (1) the game is over with less than or equal 
to 10 trials; (ii) the game is over at the 10th trial. 


2.5.7. In the same game of tossing an unbiased coin successively, suppose that a per- 
son wins the game if a head appears. What is the probability of the person winning 
the game? 


2.5.8. A balanced die is rolled twice. What is the probability of (i) rolling 6 (sum of 
the face numbers is 6)? (ii) getting an even number on both occasions? (iii) and even 
number comes in the first trial and odd number comes in the second trial? 


2.5.9. In 6/36 lottery, there are 36 numbers and a given collection of 6 will win. A per- 
son has 3 such 6/36 tickets. What is the probability that one of these three is the win- 


2.5 Conditional probabilities —— 59 


ning ticket (assume that a person will not buy tickets with the same set of numbers on 
more than one ticket)? 


2.5.10. In a 7/49 lottery, there are 49 numbers and a specific collection of 7 num- 
bers wins. A person has 3 such tickets. (i) What is the probability that one of these 
is the winning ticket (assume that no two tickets will have the same set of numbers); 
(ii) Comparing with the probabilities in Exercises 2.5.9 and 2.5.10 (i), which lottery that 
a person should prefer 6/36 or 7/49? 


2.5.11. A manufacturing unit of water heaters is known to produce 1096 of defective 
items. A customer bought 3 water heaters from this manufacturer. What is the proba- 
bility that (i) at least one of the three is defective; (ii) all are defective? 


2.5.12. Vembanad Lake contains n Karimeen (particular fish). A random sample of 
50 Karimeen were caught and tagged and then released into the lake. After several 
months, a random sample of 100 Karimeen were caught. (i) What is the probability 
that this sample contains 5 tagged Karimeen? (ii) How will you estimate n, the total 
number of Karimeen in the Vembanad Lake based on this information that out of 100 
caught 5 were found to be tagged? 


2.5.13. A box contains 3 red and 5 green identical balls. Balls are taken at random, 
one by one, with replacement. What is the probability of getting (i) 3 red and 5 green 
in 8 trials; (ii) a red ball is obtained before a green ball is obtained. 


2.5.14. In Exercise 2.5.13, if the balls are taken at random without replacement, what 
is the probability of getting (i) the sequence RRGG in four trials; (ii) RGRR in four trials? 
(ii) the third ball is green given that the first two were red? (iii) the third ball is green 
given that the first ball was red? (iv) the third ball is green and no other information is 
available. 


2.5.15. Thekkady Wildlife Reserve is visited by people from Kerala, Tamilnadu, Kar- 
nataka and from other places. For any day, suppose that the proportions are 5096, 
3096, 1096, 1096, respectively. Suppose that the probability that garbage will be thrown 
around at the reserve, on any day, by visitors from Kerala is 0.9, visitors from Tamil- 
nadu is 0.9, visitors from Karnataka is 0.5 and for others it is 0.10. (i) What is the prob- 
ability that the reserve will have garbage thrown around on any given day; (ii) On a 
particular day, it was found that the place had garbage thrown around, and what is 
the probability that it is done by Keralite visitors? 


2.5.16. In a production process, two machines are producing the same item. Machine 
1is known to produce 596 defective (items which do not satisfy quality specifications) 
and Machine 2 is known to produce 296 defective. Sixty percent of the total production 
per day is by Machine 1 and 4096 by Machine 2. An item from the day's production is 
taken at random and found to be defective. What is the probability that it was pro- 
duced by Machine 1? 


60 ---- 2 Probability 


2.5.17. In a multiple choice examination, there are 10 questions and each question is 
supplied with 4 possible answers of which one is the correct answer to the question. 
A student, who does not know any of the correct answers, is answering the questions 
by picking answers at random. What is the probability that the student gets (i) exactly 
8 correct answers; (ii) at least 8 correct answers; (iii) not more than three correct an- 
swers? 


2.5.18. There are 4 envelopes addressed to 4 different people. There are 4 letters ad- 
dressed to the same 4 people. A secretary puts the letters at random to the four en- 
velopes and mails. All letters are delivered. What is the probability that none gets the 
letter addressed to him/her? 


2.5.19. Construct two examples each of practical situations where you have two 
events A and B in the same sample space such that they are (i) mutually exclusive and 
independent; (ii) mutually exclusive and not independent; (iii) not mutually exclusive 
but independent; (iv) not mutually exclusive and not independent. 


2.5.20. For the events A}, ..., Αχ in the same sample space S, show that 
(i) P(A UA, U ++- UA) < p(A,) + Ρ(Α») + + + Ρ(Αχ); 
Gi) P(A, NA, N+» N Aj) ΣΡ(Αι) + +++ + P(A;) - (k - 1). 


2.5.21. For two events A and B in the same sample space S, show that if A and B 
are independent events (that is satisfying the product probability property) then (i) A 
and Β΄; (ii) A^ and B; (iii) A^ and Β΄ are independent events, where A‘ and Β΄ denote 
the complements of A and B in S. 


3 Random variables 


3.1 Introduction 


Random variables constitute an extension of mathematical variables just like com- 
plex variables providing an extension to the real variable system. Random variables 
are mathematical variables with some probability measures attached to them. Before 
giving a formal definition to random variables, let us examine some random experi- 
ments and some variables associated with such random experiments. Let us take the 
simple experiment of an unbiased coin being tossed twice. 


Example 3.1. Tossing an unbiased coin twice. The sample space is 
S= ALT), (T, H), (H, H), CT, ΤΊ], 


There are four outcomes or four elementary events. Let x be the number of heads in 
the elementary events or in the outcomes. Then x can take the values 0,1,2, and thus 
x is a variable here. But we can attach a probability statement to the values taken by 
this variable x. The probability that x takes the value zero is the probability of getting 
two tails and itis ie The probability that x takes the value 1 is the probability of getting 
exactly one head, whichis i . The probability that x takes the value 2is i . The probabil- 
ity that x takes any another value, other than 0, 1,2, is zero because it is an impossible 
event in this random experiment. Thus the probability function, associated with this 
variable x, denoted by f (x), can be written as follows: 


0.25, forx=0 

0.50, forx=1 
fœ) = 

0.25, forx=2 

0, elsewhere. 


Here, x takes individually distinct values with non-zero probabilities. That is, x here 
takes the specific value zero with probability i the value 1 with probability i and the 
value 2 with probability i Such random variables are called discrete random vari- 
ables. We will give a formal definition after giving a definition for a random variable. 


We can also compute the following probability in this case. What is the probability 
that x < a for all real values of a? Let us denote this probability by F(a), that is, 


F(a) = Príx x a} = probability of the event (x < a}. 


From Figure 3.1, it may be noted that when a is anywhere from --οο to 0, not includ- 
ing zero, the probability is zero, and hence F(a) = 0. At x = 0, there is a probability i 
and this remains the same for all values of a from zero to 1 with zero included but 1 ex- 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. EJEA] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https: //doi.org/10.1515/9783110562545-003 


62 ---- 3 Random variables 


cluded, that is, 0 < a < 1. Remember that we are computing the sum of all probabilities 
up to and including point x = a, or we are computing the cumulative probabilities in 
the notation Pr{x x a}. There is a jump at x = 1 equal to 3. Thus when 1 <a < 2, then 
all the probabilities cumulated up to a is 0 + i +0+ 5 +0= 2. When a is anywhere 
2<a< co, all the probabilities cumulated up to a will be 0 + i 0 5 +0+ i +0=1. 
Thus the cumulative probability function here, denoted by F(a) = Pr{x < a}, can be 
written as follows: 


0, -co«a«0 

0.25, Oxa«1 
F(a) = 

0.75, 1xa«2 

1, 2xa«oo. 


Here, for this variable x, we can associate with x a probability function f(x) and a 
cumulative probability function F(a) = Príx < a}. 


Te Figure 3.1: Left: Probability function f(x); Right: Cumulative 
ΤΩ) F(x) probability function F(x). 


Notation 3.1. Pr(c x x x d): probability of the event that c x x <d. 


Now let us examine another variable defined over this same sample space. Let y 
be the number of heads minus the number of tails in the outcomes. Then y will take 
the value -2 for the sample point (T, T) where the number of heads is zero and the 
number of tails is 2. The points (H, T) and (T, H) will give a value 0 for y and (H, H) 
gives a value 2 to y. If f (y) denotes the probability function and Fy(a) = Prly < a} the 
cumulative probability function, then we have the following, which may also be noted 
from Figure 3.2: 


0.25, y=-2 

05, y=0 
fjo)- 

025, y-2 

0, elsewhere. 

0, -co«a«-2 

0.25, -2xa«0 
F,(a) = 

0.75, O<a<2 


1, 2<a<o. 


3.1 Introduction —— 63 


Both x and y here are discrete variables in the sense of taking individually distinct 
values with non-zero probabilities. We may also note one more property that on a 
given sample space any number of such variables can be defined. The above ones, 
x and y, are only two such variables. 


x — 


sé x = 
s — 
2 0 2 2 ὦ 2 
Figure 3.2: Left: Probability function of y; Right: Cumulative 
fy) KO) probability function of y. 


Now, let us consider another example of a variable, which is not discrete. Let us ex- 
amine the problem of a child playing with scissors and cutting a string of 10 cm into 
two pieces. 


Example 3.2 (Random cut of a string). Let one end of the string be denoted by 0 and 
the other end by 10 and let the distance from zero to the point of cut be x. Then, of 
course, x is a variable because we did not know where exactly would be the cut on 
the string. What is the probability that the cut is anywhere in the interval 2 < x < 3.5? 
In Chapter 1, we have seen that in a situation like this we assign probabilities propor- 
tional to the length of the intervals and then 


3.5 - 2.0 z 15 - 045. 
10 10 


What is the probability that the cut is between 2 and 2.001? This is given by 


2.001-2.000 0.001 
10 E 


What is the probability that the cut is exactly at 2? 


Pr(2xxx3.5] - 


Pr(2 «x < 2.001} = = 0.0001. 


2-2 


Pr{x = 2} = 18 


=0. 

Here, x is defined on acontinuum of points and the probability that x takes any specific 
value is zero because here the probabilities are assigned as relative lengths. A point 
has no length. Such variables, which are defined on continuum of points, will be called 
continuous random variables. We will give a formal definition after defining a random 
variable. A probability function which can be associated with this x, denoted by f, (x), 
will be of the following form: 


1 

mw O<x<10 
Rx)e 429 

O, elsewhere. 


Let us see whether we can compute the cumulative probabilities here also. What is the 
probability that x < a for all real values of a? Let us denote this by F(a). Then when 


64 ---- 3 Random variables 


—co < a < 0, the cumulative probability is zero. When 0 <a < 10, itis 7» probabilities 
being relative lengths, and when 10 < a < co it is 5 + 0 =1. Thus we have 


0, -œ<a<0 
Εχ(α)Ξ 110. Oxa«10 


1, 10 <a< oco. 


The probability function in the continuous case is usually called the density func- 
tion. Some authors do not make a distinction; in both discrete and continuous cases, 
the probability functions are either called probability functions or density functions. 
We will use the term “probability function” in the discrete case and mixed cases and 
“density function” in the continuous case. The density and cumulative density, for the 
above example, are given in Figure 3.3. 


Figure 3.3: Left: Density function of x; Right: Cumulative den- 
sity function of x. 


Here, we may note some interesting properties. The cumulative probability function 
Εχ(α) could have been obtained from the density function by integration. That is, 


a 
10 


F,(a) = |. f(bdt - 04 ie Lat 2 E | z 


Similarly, the density is available from the cumulative density function by differenti- 
ation since here the cumulative density function is differentiable. That is, 


απ] oo 


We have considered two discrete variables associated with the random experiment 
in Example 3.1 and one continuous random variable in Example 3.2. In all of the three 
cases, one could have computed the cumulative probabilities, or Pr{x < a} was defined 
for all real a, --οο < a < co. Such variables will be called random variables. Before giv- 
ing a formal definition, a few more observations are in order. In the two discrete cases, 
we had the probability function, which were of the form: 


Λία.) = Prix 2x,] (3.1) 
and the cumulative probability function was obtained by adding up the individual 


probabilities. That is, 
F(a)=Prix<sa}= Y. fœ. (3.2) 


—oco«xxa 


In Example 3.2, we considered one continuous random variable x where we had the 
density function 


3.1 Introduction —— 65 


1 
f (x) - 10? 0<x<10 
* O, elsewhere, 


and the cumulative density function 
0, -oo<a<O 
F,(a) = Pr{x<at}=44, O<a<10 


1, 10 <a< œ. 


= [ f. (Odt. (3.3) 


Definition 3.1 (Random variables). Any variable x defined on a sample space S for 
which the cumulative probabilities Pr{x < a} can be defined for all real values of a, 
—co < A < co, is called a real random variable x. 


Definition 3.2 (Discrete random variables). Any random variable x which takes in- 
dividually distinct values with non-zero probabilities is called a discrete random 
variable and in this case the probability function, denoted by f, (x), is given by 


f(x.) = Prix =x} 


and obtained by taking successive differences in (3.2). 


Definition 3.3 (Continuous random variables). Any random variable x, which is 
defined on a continuum of points, where the probability that x takes a specific 
value x, is zero, is called a continuous random variable and the density function 
is available from the cumulative density by differentiation, when differentiable, or 
the cumulative density is available by integration of the density. That is, 


469 - [ ££) G4) 


a-x 


or 


Fua)= [^ dt. (3.5) 


Definition 3.4 (Distribution function). The cumulative probability/density func- 
tion of a random variable x is also called the distribution function associated with 
that random variable x, and it is denoted by F(x): 


F(a) = [Prix < a], -oo < a < oo]. (3.6) 


We can also define probability/density function and cumulative function, free of ran- 
dom experiments, by using a few axioms. 


66 ---- 3 Random variables 


3.2 Axioms for probability /density function and distribution 
functions 


Definition 3.5 (Density/Probability function). Any function f (x) satisfying the fol- 
lowing two axioms is called the probability/density function of a real random vari- 
able x: 

(i) f(x)2 0 for all real x, -οο < x < co; 

(ii) fee f(x)dx = 1if x is continuous; and } _<x<œ f(X) = 1if x is discrete. 


Example 3.3. Check whether the following can be probability functions for discrete 
random variables: 
2/3, x=-2 
f= 41/3, x=5 


0, elsewhere. 


3/4, x=-3 

2/4, x-0, 
fix) E 

-1/4, x22 

0, elsewhere. 

3/5, x=0 


BX) = 43/5, x=1 


O, elsewhere. 


Solution 3.3. Consider f; (x). Here, fı (x) takes the non-zero values 5 and 3 atthe points 
x= -2and x - 5, respectively, and x takes all other values with zero probabilities. Con- 
dition (i) is satisfied, f (x) > 0 for all values of x. Condition (ii) is also satisfied because 
3 + i + 02 1. Hence f, (x) here can represent a probability function for a discrete ran- 


dom variable x. We could have also stated f; (x) as follows: 
f,(-2) = f,(5) = f(x)=0 elsewhere 


where, for example, {1(--2) means f,(x) at x = -2. 

f,(x) is such that >’, f;(x) = 1, and thus the second condition is satisfied. But f(x) 
at x =2or f,(2) = -i which is negative, and hence condition (i) is violated. Hence f,(x) 
here cannot be the probability function of any random variable. 

f; € is non-negative for all values of x because f(x) takes the values 0, 2, B but 


3 3 6 
x)=0+=+-=->l1. 
2} 3) 5 5 5 
Here, condition (ii) is violated, and hence f;(x) cannot be the probability function of 


any random variable. 


3.2 Axioms for probability/density function and distribution functions —— 67 


Example 3.4. Check whether the following can be density functions of some random 
variables: 


f= ER a<x<b,b>a 
1 0, elsewhere. 


elsewhere. 


cxt, O«x«1 
hx) = 0 


e δ, Oxx«oo 
f. 3 (x) = 
0, elsewhere. 


X, O<x«<1 
fax) =32-x, 1<x<2 
0, elsewhere. 


Solution 3.4. f(x) is non-negative since it is either O or Ez where b - a » 0. Hence 
condition (i) is satisfied. Now, check the second condition: 


9o b 4 x 1 b-a 
=0 = = =f; 
κ. a bea eal. b-a 


Hence the second condition is also satisfied. It is a density function of a continuous 
random variable. The graph is given in Figure 3.4. 


a. das b Figure 3.4: Uniform or rectangular density. 


This density looks like a rectangle, and hence it is called a rectangular density. Since 
the probabilities are available as integrals or areas under the curve if we take any in- 
terval of length e (epsilon) units, say from d to d + €, then the probability that x falls 
in the interval d to d + € or d < x < d + € is given by the integral: 


d+e 1 ε 
— dx= ——. 
| b-a b-a 
Since it is a rectangle, if we take an interval of length e anywhere in the interval 
a<x < b, then the area will be the same as Ba or we can say that the total area 1 is 
uniformly distributed over the interval [a, b]. In this sense, this density Π(χ) is also 


called uniform density. Also we may observe here that these unknown quantities a 
and b could be any constants, free of x. As long as b > a, f, (x) is a density. 


68 ---- 3 Random variables 


f,(x) = 0 for all values of x if c > 0 since either it is zero or x^ in the interval [0,1] 
which is positive. Thus condition (i) is satisfied if c > 0. Now, let us check condition (ii): 


Hence condition (ii) is satisfied if c = 5. For c = 5, f,(x) is a density function. 

f(x) satisfies condition (i) when 0 (theta) is positive because an exponential func- 
tion can never be negative. Hence f;(x) takes zero or a positive value only. Now let us 
check the second condition: 

n NE: eov]. zx x 
| -e δάχΞ0- | ze ?dx-[-e 8]? =1. 
-œ 9 ο 0 9 
Hence itis a density. Note that whatever be the value of 0 as long as it is positive, f,(x) is 
a density, see Figure 3.5. 


Figure 3.5: Exponential or negative exponential density. 


Since this density is associated with an exponential function it is called an exponential 
density. Note that if 0 is negative, then j « 0 even though the exponential function 
remains positive. Thus condition (i) will be violated. If 0 is negative, then the exponent 
= > 0 thereby the integral from 0 to co will be co. Thus condition (ii) will also be 
violated. For 0 < 0 here, f;(x) cannot be a density. When integration is from 0 to oo, 
the exponential function with a positive exponent cannot create a density we need not 
say "negative exponential density" and we simply say that itis an exponential density, 
and it is implied that the exponent is negative. 

ΠΧ) is zero or x in [0,1) and 2 - x in [1,2], and hence f, (x) = 0 for all x and con- 
dition (i) is satisfied. The total integral is available from the integrals over the several 
intervals: 


ie f (dx = 04 f xdx + fe - x)dx +0 


Thus, condition (ii) is also satisfied and f; (x) here is a density. 
The graph of this density looks like a triangle, and hence it is called a triangular 
density as shown in Figure 3.6. 


3.2 Axioms for probability/density function and distribution functions —— 69 


xN 2-x 


0 l 2 x3 Figure 3.6: Triangular density. 


Definition 3.6 (Parameters). Arbitrary constants sitting in a density or probability 
function are called parameters. 


In fi (x) of Example 3.4, there are two unknown quantities a and b. Irrespective of 
the values of a and b, as long as b > a then we found that f; (x) was a density. Hence 
there are two parameters in that density. In f(x) of Example 3.4, we had one unknown 
quantity 0. As long as 0 was positive, f;(x) remained as a density. Hence there is one 
parameter here in this density, and that is 0 > 0. 


Definition 3.7 (Normalizing constant). If a constant sitting in a function is such 
that for a specific value of this constant the function becomes a density or proba- 
bility function then that constant is called the normalizing constant. 


In f(x) of Example 3.4, there was a constant c but for c 5, f; (x) became a density. 
This c is the normalizing constant there. 


Definition 3.8 (Degenerate random variable). If the whole probability mass is 

concentrated at one point, then the random variable is called a degenerate ran- 

dom variable or a mathematical variable. Consider the following density/probabil- 

ity function: 

eue me 
0, elsewhere. 

Here, at x - b the whole probability mass 1 is there and everywhere else the function 
is zero. The random variable here is called a degenerate random variable or with prob- 
ability 1 the variable x takes the value b or it is a mathematical variable. If there are 
two points such that at x = c we have probability 0.9999 and at x = d + c we have prob- 
ability 0.0001, then it is not a degenerate random variable even though most of the 
probability is at one point x = c. 


Thus, statistics or statistical science is a systematic study of random phenomena 
and random variables, extending the study of mathematical variables, and as such 
mathematical variables become special cases of random variables or as degenerate 
random variables. This author had coined the name "Statistical Science" when he 
launched the Statistical Science Association of Canada, which became the present 
Statistical Society of Canada. Thus in this author's definition, statistical sciences has 
a wider coverage compared to mathematical sciences. But nowadays the term mathe- 
matical sciences is used to cover all aspects of mathematics and statistics. 


70 —— 3 Random variables 


Example 3.5. Compute the distribution function for the following probability func- 
tions: 


0.3, χ--2 

02, x=0, 
fix) = 

05, x=3 


0, otherwise; 


1)x = 
fab = T d coo 


0, otherwise. 
Solution 3.5. The distribution function in the discrete case is 


F(a-Prxsaj- Y. fœ. 


—co«xxa 


Hence for f(x), it 15 zero for —co < x < —2, then there is a jump of 0.3 at x = -2, and so 
on. Therefore, 


0, -co«a«-2 

0.3, -2xa«0 
F(a)- 

0.5(=0.3+0.2), O<a<3 

1, 3<a<o. 


It is a step function. In general, for a discrete case we get a step function as the distri- 
bution function. 
For f(x), the normalizing constant c is to be determined to make it a probability 
function. If it is a probability function, then the total probability is 
2 


x 
O+ Y (cz) -ocd1« 2 r)ecz. 
2 2 4 4 


x=0 


Hence for c = $, f(x) is a probability function and it is given by 


4/7, x=0 
12, x=1 
horz 1/77, x=2 


0, otherwise. 


0, -co«x«0 

4/7, O<x<l1 
F(x)- 

6/7, 1<x<2 


1, 2«χςοο. 


3.2 Axioms for probability/density function and distribution functions —— 71 


Again, note that it is a step function. The student may draw the graphs for the distri- 
bution function for these two cases. 


Example 3.6. Evaluate the distribution function for the following densities: 


1 x 
ze 6, Ozxx«oo 
Πα) = | 


; otherwise; 
X, O«x«1 
βο)-12-χ, 1<x<2 
0, otherwise. 


Solution 3.6. The distribution function, by definition, in the continuous case is 
t 
FO =| food 
-οο 


Hence in Π(Χχ), 


-[-e |i -1-e 6, Oxt«oo, 


and zero from --οο < x «0. For f;(x), one has to integrate in different pieces. Evidently, 

F(t) = 0 for -co < t < 0. When t is in the interval 0 to 1, the function is x and its integral 
2 

is Si Therefore, 


When t is in the interval 1 to 2 the integral up to 1, available from E at t = 1 which is j, 


plus the integral of the function (2-- x) from 1 to t is to be computed. That is, 
1 t 1 24t 2 
sfe x)dx = + [2x x] =-1+2t 
2 À 2 21, 


When t is above 2, the total integral is one. Hence we have 


0, -co«t«0 

t 

> Oxt«1 
F(t)= 1? P 

-1+2t- 5, 1st<2 

1, t2 2. 


The student is asked to draw the graphs of the distribution function in these two den- 
sity functions. 


72 —— 3 Random variables 


3.2.1 Axioms for a distribution function 


If we have a discrete or continuous random variable, the distribution function is 
F(t) = Pr{x < t}. Without reference to a random variable x, one can define F(t) by 
using the following axioms: 

(i Εί-οο)-0; 

(1) F(co) 21; 

(iii) F(a) < F(b) for all a « b; 

(iv) F(t) is right continuous. 


Thus F(t) is a monotonically non-decreasing (either it increases steadily or it remains 
steady for some time) function from zero to 1 when f varies from --οο to co. The student 
may verify that conditions (i) to (iv) above are satisfied by all the distribution functions 
that we considered so far. 


3.2.2 Mixed cases 


Sometime we may have a random variable where part of the probability mass is dis- 
tributed on some individually distinct points (discrete case) but the remaining proba- 
bility is distributed over a continuum of points (continuous case). Such random vari- 
ables are called mixed cases. We will list one example here, from where it will be clear 
how to handle such cases. 


Example 3.7. Compute the distribution function for the following probability func- 
tion for a mixed case: 


, χ--2 


1 
2 

1ϱ}-ἠχ, O<x<1 
0 


, Otherwise. 


Solution 3.7. The definition for the distribution function remains the same whether 
the variable is discrete, continuous or mixed: 


F(t) = Príx < t}. 


For —oo < t < -2, obviously F(t) = 0. There is a jump of I att = -2and then it remains 
the same until 1. In the interval [0,1], the function is x and its integral is 


t 24t 2 
| xdx = [5] HE 
0 2 0 2 


For t greater than 1, the total probability 1 is attained. Therefore, we have 


3.2 Axioms for probability/density function and distribution functions ---- 73 


0, -co«t«-2 

1 

5> -2xt«0 
F(t) = 1 g 

2*7 Oxt«1 

1, t 21. 


The graph will look like that in Figure 3.7. 


f(x) Ες 


T T 
e 
: κῆ Ν ‘ ' Figure 3.7: The distribution function for a mixed 
-2 0 1 


-2 0 l x case. 


Note that for t up to O it is a step function then the remaining part is a continuous 
curve until 1 and then it remains steady at the final value 1. 


Example 3.8. Compute the probabilities (i) Pr[-2 < x < 1}, (ii) Pr{O < x < 1.7} for the 
probability function 


0.2, x--1, 

0.3, x20, 
fœ) = 40.3, x-15, 

0.2, x=2, 


0, otherwise. 


Solution 3.8. In the discrete case, the probabilities are added up from those at indi- 
vidual points. When -2 x x x 1, the probabilities in this interval are 0, 0.2 at x = -1and 
0.3 at x = 0. Therefore, the answer to (i) is O-- 0.24-0.3 = 0.5. When 0 x x x 17, the prob- 
abilities are 0, 0.3atx = Oand 0.3 at x = 1.5. Hence the answer to (ii) is O--0.34- 0.3 = 0.6. 


In the discrete case, the probability that x falls in a certain interval is the sum of 
the probabilities from the corresponding distinct points with non-zero probabilities 
falling in that interval. 


Example 3.9. Compute the following probabilities on the waiting time t, (i) Pr{O < 
t < 2}, (ii) Pr(3 < t x 10} if the waiting time has an exponential density with the param- 
eter 0 =5. 


Solution 3.9. The waiting time having an exponential density with parameter 0 = 5 
means that the density of t is given by 


qt 

ze 5, Oxft«oo 
f(0 24? 

0, elsewhere. 


74 — 3 Random variables 


Probabilities are the areas under the density curve between the corresponding ordi- 
nates or the integral of the density over the given interval. Hence for (i) the probability 
is given by 
2 
[ sea petes ed. 
o5 
In a similar manner, the probability for (ii) is given by 


t 


[font =[-e5 I =e5 -e 
3 


The following shaded areas in Figure 3.8 are the probabilities. 


Pr(ücx£2) 


Pr(3£x€10j 
ψ 


Figure 3.8: Probabilities in the exponential density. 


In a continuous case, the probability of the variable x falling in a certain interval 
[a, b] is the area under the density curve over the interval [a, b] or between the or- 
dinates at x = a and x = b. 


Exercises 3.2 


3.2.1. Check whether the following are probability functions for some discrete ran- 
dom variables: 


3, x=-l 2, X= 3 
Πί)Ξ413, x-1 βᾳὸ-ἠ1, x=} 

0Ο, elsewhere; 0Ο, elsewhere. 

12, χ-ο 0.8, x=1 
fi) = 1-0. x=1 fif0-103, x=2 

0, elsewhere; 0, otherwise. 


3.2.2. Check whether the following are density functions for some continuous ran- 
dom variables: 


cx? 3x41, O<x<2 
Πο) - . 
0, otherwise; 


3.2 Axioms for probability/density function and distribution functions —— 75 


mod 1<x< œ 


O, otherwise; 


βία) - ce P, Loo «x «oo; 
oe, 0<x<2 
fip0-16-x 2<x<6 
0, otherwise. 


3.2.3. An unbiased coin is tossed several times. If x denotes the number of heads in 
the outcomes, construct the probability function of x when the coin is tossed (i) once; 
(ii) two times; (iii) five times. 


3.2.4. In a multiple choice examination, there are 8 questions and each question is 
supplied with 3 possible answers of which one is the correct answer to the question. 
A student, who does not know any of the correct answers, is answering the questions 
by picking the answers at random. Let x be the number of correct answers. Construct 
the probability function of x. 


3.2.5. In Exercise 3.2.4, let x be the number of trials (answering the questions) at 
which the first correct answer is obtained, such as the third (x = 3) question answered 
is the first correct answer. Construct the probability function of x. 


3.2.6. In Exercise 3.2.4, let the x-th trial resulted in the 3rd correct answer. Construct 
the probability function of x. 


3.2.7. Compute the distribution function for each probability function in Exercise 3.2.1 
and draw the corresponding graphs. 


3.2.8. Computethe distribution function for each probability function in Exercise 3.2.2 
and draw the corresponding graphs also. 


3.2.9. Compute the distribution functions and draw the graphs in Exercises 3.2.3- 
3.2.6. 


3.2.10. For the following mixed case, compute the distribution function: 


1 
2» X= -5 
X, 0<x<1, 
Γ0)- 41, 
2» ΧΞ 5 
O, otherwise. 


3.2.11. In Exercise 3.22, compute the following probabilities: (i) Pr{1 < x x 1.5} for 
fix); Gi) Pr(2 € x < 5} for f; (x); (iii) Pr(-2 < x < 2) for f(x); (iv) Pr{1.5 < x < 3} for f(x). 


3.2.12. In Exercises 3.2.4 and 3.2.5, compute the probability for 2 < x < 5, and in Exer- 
cise 3.2.6 compute the probability for 4 x x x 7. 


76 ---- 3 Random variables 


Note 3.1. Forafull discussion of statistical densities and probability functions in com- 
mon use, we need some standard series such as binomial series, logarithmic series, 
exponential series, etc. We will mention these briefly here. Those who are familiar with 
these may skip this section and go directly to the next chapter. 


3.3 Some commonly used series 


The following power series can be obtained by using the following procedure when 
the function is differentiable. Let f(x) be differentiable countably infinite number of 
times and let it admit a power series expansion 


f(X) = αρ 1 4X + A,X? + Εαμχ" ss 
then the coefficient 


[3 f GOL οἱ 


n! 


n 


or the series is 


(1) Q) 
fog FO) 2 


EN 2! G7) 


fx) =f) + 


where f? (0) means to differentiate f(x), r times and then evaluate at x = 0. All of the 
following series are derived by using the same procedure. 


3.3.1 Exponential series 


gp ων ο forall G.8) 
1 2! ΤΙ 
X de x" 

ae ee ty espe +- forall xX, G.9) 
I 2 r! 


3.3.2 Logarithmic series 


Logarithm to the base e is called the natural logarithms and it is denoted by ln. 
2-3 
Index) -x- 54+ Da for |x| « 1. (3.10) 
For the convergence of the series, we need the condition |x| « 1: 


2 3 
In(1-x)- x45 T tee], dx|«1. (3.11) 


3.3 Some commonly used series ---- 77 


3.3.3 Binomial series 


The students are familiar with the binomial expansions for positive integer values, 
which can also be obtained by direct repeated multiplications, and the general result 
can be established by the method of induction: 


(1+x)*=14+2x4+x*; (a+b? - α’ +2ab+b’; 


(14x)? =14+3x4+3x2+x3; (a+b)? = a? «3a? b + 3ab? +b’; 


ο). n=1,2,...; (3.12) 
0 1 n 
(a +b)" = (p)ar + (1) a" pss (") aO p", 

0 1 n 


n=1,2,.... 


What happens if the exponent is not a positive integer, if the exponent is something 
like 1 -20,- 7 or some general rational number a (alpha)? We can derive an expansion 
by using (3.7). Various forms of these are given below: 


(1-x)"214-—x-« te, dx|«t (3.13) 


(a); (25 240.4 rye 
1! 2! 


If a is not a negative integer, then we need the condition |x| « 1 for the convergence of 
the series. The Pochhammer symbol is 
(a), =a(a+1)---(a+r-1), aA#0, (α)οΞ1. (3.14) 


Various forms of (3.13) can be obtained by replacing x by —x and a by --α. For the sake 
of completeness, these will be listed here for ready reference: 


(1«xy*s[1-C]|*s1 τα + Oy? es [χ[«1. (3.15) 
(1-x)(1-x)C€9z14 Cx C te 

for |x| < 1. (3.16) 

1«*-2[- (2| €? =1 Dy + s Xoe, (3.17) 


for |x| < 1. In all cases, the condition |x| < 1 is needed for the convergence of the series 
except in the case when the exponent isa positive integer. When the exponentis a > 0, 
then the coefficient of x is (-a),. If a is a positive integer, then this Pochhammer sym- 
bol will be zero for some r and the series will terminate into a polynomial, and hence 
the question of convergence does not arise. We have used the form (1 + x)**. This is 
general enough because if we have a form 


(ax b)** = a*(1 ͵ Ey. 


78 —— 3 Random variables 


and thus we can convert to the form (1 + x) by taking out a or b to make the resulting 
series convergent. 


3.3.4 Trigonometric series 


x x? 
sinx =X- — + — — 
3! bl 
2 yA 
cosx =1- — + — - 
2. 4! 
οἵ =cosx+isinx, i= V-I. 
3.3.5 Anote on logarithms 
The mathematical statement 
a“=b 


can be stated as the exponent x is the logarithm of b to the base a. For example, 2? = 8 
can be written as 3 (the exponent) is the logarithm of 8 to the base 2. The definition 
is restricted to b being strictly a positive quantity when real or logarithm of negative 
quantities or zero is not defined in the real case. The standard notations used are the 
following: 

log b = log,, b or common logarithm or logarithm to the base 10. When we say 
“log y”, itis a logarithm of y to be base 10. 

lnb = log, b or natural logarithm or logarithm to the base e. When we say “In y", 
it is a logarithm of y to be base e. 

For all other bases, other than 10 or e, write the base and write it as log, b. This 
note is given here because the students usually do not know the distinction between 
the notations “log” and “In”. For example, 


d 
dx 


d logx = : log,,e# x 
x x 


hess 
X dx 


Note 3.2. In Section 3.2.1, we have given an axiomatic definition of a distribution func- 
tion and we defined a random variable with the help of the distribution function. Let 
us denote the distribution function associated with a random variable x by F(x). If 
F(x) is differentiable at an arbitrary point x, then let us denote the derivative by f(x). 
That is, iF (x) =f(x), which will also indicate that 


Fœ) = Γ f (bt. 


3.3 Some commonly used series —- 79 


In this situation, we call F(x) an absolutely continuous distribution function. Abso- 
lute continuity is related to more general measures and integrals known as Lebesgue 
integrals. For the time being, if you come across the phrase “absolutely continuous 
distribution function", then assume that F(x) is differentiable and its derivative is the 
density f (x). 


Note 3.3. Suppose that a density function f(x) has non-zero part over the interval 
[a, b] and zero outside this interval. When x is continuous, then the probability that 
x =a, that is, Pr{x = a} = 0 and Príx = bj = 0. Then the students have the confusion 
whether f(x) should be written as non-zero ina<x<bora<x<bora<x<bor 
à « x « b. Should we include the boundary points x = a and x = b with the non-zero 
part of the density or with the zero part? For example, if we write an exponential den- 
sity: 
foo = d 0»0,0xx«oo 
0, elsewhere 


should we write 0 < x < co or 0 < x < co. Note that if we are computing only probabil- 
ities then it will not make any difference. But if we are looking for a mode, then the 
function has a mode at x = 0 and if x = 0 is not included in the non-zero part of the 
density, then naturally we cannot evaluate the mode. For estimation ofthe parameters 
also, we may have similar problems. For example, if we consider a uniform density 


1 
ο. asx<b 


0, elsewhere 


then what is known as maximum likelihood estimates [discussed in Module 7] for the 
parameters a and b do not exist if the end points are not included. That is, if the non- 
zero part of the density is written as a « x « b, then the maximum likelihood estimates 
for a and b do not exist. Hence when writing the non-zero part of the density include 
the end points of the interval where the function is non-zero. 


Note 3.4. Notethat when arandom variables x is continuous, then the following prob- 
ability statements are equivalent: 


Pr{a < x < b} = Pr{a < x < b} = Pr{a < x < b} = Pr{a < x < b} 
= F(b) - F(a) 


where F(x) is the distribution function. Also when F(x) is absolutely continuous 
b d 
F(b)-F(a)=[ ftdt or 4 Ρ00-/6) 
a dx 


where f (x) is the density function. 


80 —— 3 Random variables 


Exercises 3.3 


3.3.1. By using a binomial expansion show that, for n - 1,2, ... 


ο 6) 
(3-40) 


(11 x)" (1 4 x)" Ξ (14 χ) 


3.3.2. By using the identity, 


and comparing the coefficient of x” on both sides show that 


m n m+n 
( ἥ t J m,n=1,2,.... 
VES r-s r 


3.3.3. By using the identity, 


[-1- 


5 


(1 4 x) (1 1 χ) 2... (14x) = (14 x)'t tn 


and comparing the coefficient of x' on both sides show that 


Σ-Σ(7) (2) i (3) à (^) 
RR E Nb Ty r 
whererzr, +: «rj,nznj +--+, Nj =1,2,...5) 


1,...,k. 
3.3.4. Show that 
n n 

Y TR RUE Y aa nne Dey 
m=1 2 m=1 6 
n 

y -- ως 

m=1 2 


3.3.5. Compute the sums Y? , m*; Y» m^; Y», , m", p=6,7,.... 
3.3.6. Show that 


τ Jr" 
Pee ο απ er ) 


co 
a 
a+ar+a? +- =a) "= —, for |r| <1. 
far 1-r 


3.3.7. What is the infinite sum in Exercise 3.3.6 for (i) r = 1; (ii) r = —1; (iii) r > 1; 
(iv) r < -1. 


3.3 Some commonly used series ---- 81 


3.3.8. Evaluate the sum Σο (E-1)p*q*, q 21- p, O «p «1. 
3.3.9. Evaluate the sum ΣΤ , (")p*e%q"*, q=1-p,0<p<1. 


3.3.10. Compute the sum $X, (C1)p*e*q**, q21-p, 0 «p«1. 


4 Expected values 


4.1 Introduction 


Here, we will start with some commonly used probability functions and density func- 
tions and then we will define the concepts called expected values, moments, mo- 
ment generating functions, etc. In Chapter 3, we have defined a probability function, 
a density function and distribution function or cumulative probability/density func- 
tion. There is another term used in statistics and probability literature called “distri- 
butions" something like *exponential distribution", *normal distribution", etc. There 
is a possibility of confusion between a distribution and a distribution function. A dis- 
tribution function is the cumulative probability/density function as defined in Chap- 
ter 3, whereas when we say that, for example, we have an exponential distribution or 
a variable x is exponentially distributed, we mean that we have identified a random 
variable, a density function or a distribution function, and it is the random variable 
having exponential density. When we say we have a uniform distribution, it means 
that we have identified a random variable having a uniform or rectangular density or 
we have identified a random variable that is uniformly distributed. It is unfortunate 
that two technical terms, “distribution” and “distribution function", which are very 
similar, are used in statistical literature. Hopefully, the students will get accustomed 
to the technical terms fast and will not be confused. 

We will introduce the concept of expected values first and then we will deal with 
commonly appearing probability/density functions or commonly appearing statistical 
distributions. 


4.2 Expected values 


Notation 4.1. E(-) = E[(-)] expected value of (-). 


Definition 4.1. Expected value of a function w(x) of the random variable x: Let x 
bea random variable and let ψ(χ), (ψ is the Greek letter psi) be a function of x. Then 
the expected value of y(x), denoted by E[y(x)] = EY(x) is defined as 


E[UG)]- E(OO) - > of when x is discrete 


= |. WoofGodx when x is continuous (4.1) 


where f(x) is the probability/density function. 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. ABARH] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https: //doi.org/10.1515/9783110562545-004 


84 — 4 Expected values 


When f(x) is a density function, obtained by differentiating a distribution function 
F(x), then instead of writing f(x)dx we may also write it as dF(x) in (4.1). Expected 
values need not exist for all functions w(x) and for all random variables x. 


Example 4.1. Consider the probability function: 


0.2, x--1 

0.3, x=0 
1) - 

0.5, χ-2 


O, elsewhere. 
Evaluate (i) E[x]; (ii) E[2x? — 5x]; (iii) 8. 


Solution 4.1. In (i), the function (x) = x and by definition, we multiply x with the 
probability function and sum up over all values of x. Here, the random variable takes 
only -1,0,2 with non-zero probabilities and all other values with zero probabilities. 
Hence when we multiply with the probability function everywhere it will be zeros ex- 
cept at the points x = -1, x = 0 and x = 2. Therefore, for (i), 


E[x] = Σ xf (x) = 0 + (-1)(0.2) + (0)(0.3) + (2)(0.5) = 0.8. 


—00<X<00 


When x takes the value -1, the corresponding probability is 0.2. Hence xf(x) = 
(-1)(0.2) = -0.2 at x = -1. Similarly, xf (x) = (0)(0.3) = 0 at x = 0 and xf (x) = (2)(0.5) = 
1.0 at x = 2. Then we added up all these to get our final answer as 0.8. 

(ii) Here, we need the expected value of 2x? -- 5x. That is, 


Ε[2χ2-5]- Y [2χ}-5χ]|γρ}- Y [ox? - »x]foo. 


—00<X<00 -10,2 


When x = -1, the probability f(x) = 0.2, and hence the corresponding value of 
[2 - 5x]f œ) = [2(-1)? - 5(-1)](0.2) = 2(-1)7(0.2) - 5(-1)(0.2) = 1.4. 
When x = 0, 2x? — 5x = 0, and hence this term will be zero. When x = 2, 
[2x? - 5x]f (x) = [2(2)? - 5(2)](0.5) = -1. 


Hence the final sum (1.4) + (0) + (-1) = 0.4 is the answer to (ii). We may also note one 
interesting property that the above expected value can also be computed as 


E[2x? - 5x] = 2E[x?] - 5E[x]. 
In (iii), we do not have any variable x. Hence from the definition 


E[8]- 5, 8f00-8 Y», f00-8 


-Ὅοο«χ«οο -Ὅοο«χ«οο 


since the total probability or >’, f(x) = 1 by definition. This in fact 15 a general result. 


4.2 Expected values —- 85 


Result 4.1. Whatever be the random variable under consideration, the expected 
value of a constant is the constant itself. That is, 


E[c] 2 c (4.2) 
whenever c is a constant. 


One more property is evident from the computations in Example 4.1. When a sum 
was involved in (ii), we could have obtained the final answer by summing up individ- 
ually also. The following general result follows from the definition itself. 


Result 4.2. The expected value of a constant times a function is the constant times 
the expected value of the function, and the expected value of a sum is the sum of the 
expected values whenever the expected values exist. That is, 


Elcp(x)] = cE[w G0] (4.3) 
E[ap (x) + by;60] = E[av,60] + E[bw;00] 
Ξαξ[ψι(α)] + bE[w5(x)] (4.4) 


where a and b are constants and (x) and ψ2(Χ) are two functions of the same ran- 
dom variable x. 


Note that once the expected value is taken the resulting quantity is a constant and 
it does not depend on the random variable x anymore. 


Result 4.3. For any random variable x, for which E(x) exists, 
E[x - E(x)] - 0. (4.5) 


The proof follows from the fact that E(x) is a constant and the expected value of a 
constant is the constant itself. Taking expectation by using Result 4.2, we have 


E[x - E(x)| = E[x] - E[EGO] = E(x) - E(x) = 0. 


Example 4.2. If a discrete random variable takes the values x, ... ,X„ with probabili- 
ties p4, ..., Pn, respectively, compute E [x] for the general case as well as for the partic- 
ular case when p; = p; = + = Pn = L 
Solution 4.2. We may note that a general discrete random variable is of the general 
type described in this example. The variable takes some values with non-zero proba- 
bilities and other values with zero probabilities. In Example 4.1, the variable took the 
values --1,0,2 with non-zero probabilities and other values with zero probabilities. If 
we draw a correspondence with Example 4.2, then x, = --Ί, x; = 0, x = 2, n =3. Let us 


86 — 4 Expected values 


consider the special case first. In the special case, all the probabilities are equal to i 
each. Hence 


Ek- Y xo -oex(I)ex()e (E) 


-οο«χ«οο 
Xp rin yt 


n 


- x. (4.6) 


In this case, it is the average of the numbers x,, ..., x, or E[x] in this case corresponds 
to the concept of average. In the general case, 


n 
i=1 Xii 


Ex]- Y | xf) =0q)@) +--+ 60 = S (4.7) 


—00<X<00 i=1 Fi 


and note that 7 , p; = 1 by definition. The last expression in (4.7) is the expression for 
the centre of gravity of a physical system when p,,...,p, are weights, or forces acting 
at the points x, ..., x,. Hence E[x] can be interpreted in many ways. 


ΕΙΧ] is the mean value of x or some sort of an average value of x and it is the centre 
of gravity of a physical system. 


Example 4.3. Evaluate the mean value of x if x has the following density: 


fa [e 1Xx«oo 


0, elsewhere. 


Solution 4.3. Since itis a continuous case, we integrate to find the expected values. 
Therefore, 


55 1 
-o«[ x( 5 Jax 
- [κ ides [In x]f? = co 
1 X 


Here, the mean value is not a finite quantity. When an expected value is not a definite 
finite quantity, we say that the expected value does not exist. 


Expected value of ij (x) does not exist when: 

(i) Εἰφα)]-τοο 

(ii) E[U O0] 7 -oo 

(iii) E[V(x)] oscillates between finite or infinite quantities, or the sum (in the dis- 
crete case) or the integral (in the continuous case) does not converge. 


Example 4.4. If x is uniformly distributed over the interval [a, b], compute (i): the 
mean value of x, (ii): E[x — E[(x)]*. 


4.2 Expected values —- 87 


Solution 4.4. Since we are told that x is uniformly distributed, the density of x is a 
rectangular or uniform density. That is, 


flo = E a<x<b,b>a 
0, elsewhere. 


Hence the mean value of x, by definition, 


E[x] - [-χούάκ-ο» κ 


£ a b-a 
-| x? [mes κο pie 
2(b-a)1, 2(b-a) 2(b - a) 2° 


Thus, the mean value is the middle point of the interval here or the average of the end 
points as shown in Figure 4.1. 


E (x) 
>—_— 
a bia b 
2 Figure 4.1: Expected value in the uniform density. 


(ii) For computing this, either we can substitute for E(x) and then compute directly, or 
simplify first and then compute. Let us simplify first by using Results 4.1 and 4.2: 


2 
E[x - Eco]? = Ep - 2xE(x) + (EGOY?] -z|e ΟΝ (555) | 
ey a+b ) a+b i 
= E(x’) x 3 Eco + ( 5 
since the expected value of a constant is the constant itself, and 


= F(x) ος SEB) e 


2 2 2 


This means that we have to only compute E (x^). But 


b 2 3 3 2 2 
2» _ x |. (b^-a) (a *ab«b^ 
B(x?) =| b-a μα 3 ᾿ 


Hence 


_@+ab+b? (a+b? (b-ay 


2 
ΤΕΝ 3 4 12 


This quantity E[x -- E(x)]? is a very important quantity and it is called the variance. 


88 —— 4 Expected values 


Notation 4.2. Var(x) = 0? = o2 = variance of x, where o is the Greek letter sigma. 


Definition 4.2 (Variance of a random variable). It is defined as the following ex- 
pected value for any random variable x: 


Var(x) = o2 = E[x - E(x)’ = E(x) - [EGO]. (4.8) 


Since Var(x) 5 0, we have from (4.8), EG?) 5 [E(x)]? > [E(2)]? > |E(x)|. The last part 
of (4.8) is available by opening up and then simplifying by using Results 4.1 and 4.2 
and by using the fact that E(x) is a constant. It is already derived in the solution of 
Example 4.4. 


Notation 4.3. 0,0,, y Var(x): Standard deviation of x. 


Definition 4.3 (Standard deviation). The standard deviation of a real random vari- 
able x is defined as the positive square root of the variance of x. It is abbreviated 
as S.D. That is, 


S.D = o, = + VVar(x). (4.9) 


What are the uses of the variance and standard deviation? We have already seen that 
the mean value = E(x) has the interpretation of average, central value, central ten- 
dency of x, centre of gravity of a physical system. Similarly, the variance corresponds 
to the moment of inertia in a physical system. Standard deviation is a mathematical 
distance of x from the point E(x) or from the centre of gravity of the system, and hence 
the standard deviation can be taken as a measure of scatter or dispersion of x from the 
mean value = E(x). A high value for the standard deviation means that the variable is 
more spread out and small value for standard deviation means that the variable is con- 
centrated around the centre of gravity of the system or around the central value = E(x). 


4.2.1 Measures of central tendency 


One measure of central tendency of the random variable x is the mean value or ex- 
pected value of x, E(x). Other measures in common use are the median and the mode. 
The idea of a median value is that we are looking for a point M such that the probability 
of the random variable x falling below is equal to the probability of x falling above M. 
In this sense, M is a middle point. M may not be unique and there may be several 
points qualifying to be medians for a given x. The following is a formal definition for 
the median. 


Notation 4.4. M: median of x. 


4.2 Expected values — 89 


Definition 4.4 (Median of a random variable x). Let M be a point such that 


Prix < M} > 


NIA 


and 


Prix > Mj 2 


N | 5 


All points M satisfying these two conditions are called median points. 


In some cases, we can have a unique point M and in other cases we can have 
several values for M. We will examine some discrete and some continuous cases. 


Example 4.5. Consider the following probability functions: 


0.4, x=1 

0.2, x=2 
πο = 

O4, x=7 


O, elsewhere; 


0.5, 2-1 
fol) = | : 


Q5, x=5 


Compute the medians in each case. 


fe) x * 


—M—— Figure 4.2: Left: Probability function f(x); 
1 7 -l 5 Right: Probability function f,(x). 


Solution 4.5. In f; (x) of Figure 4.2, the point x = 2 satisfies both the conditions: 
Pr{x<2}=06>0.5 and Pr{x>2}=06>0.5 


and hence x = 2 is the unique median point for this x. 
In f(x), any point from -1 to 5 will qualify or -1< M <5. If a unique value is pre- 
ferred, then one can take the middle value of this interval, namely, x = 2 as the median. 


Example 4.6. Compute the median point or points for the following densities: 


ΠΠ O<x<2 
(x) = 


O, elsewhere; 


90 —— 4 Expected values 


—x 


Figure 4.3: Left: Density fı (x); Right: Density f; (x). 


X; O<x<1 
Ρία)Ξ13-χ, 2<x<3 
0, otherwise. 


Solution 4.6. For Π(χ) of Figure 4.3, we can get a unique median point. Let us com- 


pute the probability 
M 2 
Prix <M} => | ρε 
2 Jo 4 


We can equate this to 7 because in this case 
1 
Prix < M}= 5 = Prix > M}. 
E :ng MÀ 1 _ 
quating 7- to 2 we have M = v2. 


In the case of f,(x), let us compute the probabilities in the two pieces where we 
have non-zero functions: 


1 
| xdx = = 
0 
and 
3 dx 1 
| (3 - x)dx = A 


This means that any point from 1 to 2, 1 < M <2 will qualify to be a median point. If a 
single point is preferred, then 1.5 is the candidate. 


Another measure used as a measure of central tendency of a random variable x is 
the mode. This is grouped with measures of central tendency but it does not have much 
to do with measuring central tendency. The point(s) corresponding to local maximum 
(maxima) for the density curve (in the continuous case) is (are) taken as mode(s) in 
the continuous case, and the point(s) corresponding to the local maximum probability 
mass is (are) taken as mode(s) in the discrete case. In Example 4.5, we would have 
taken 1 and 7 as modes in Π(χ), and -1 and 5 as modes in f,(x). In Example 4.6, we 
would have taken x = 2 as the mode in f; (x) and 1 and 2 as modes in f,(x). 


4.2 Expected values —- 91 


4.2.2 Measures of scatter or dispersion 


We have already introduced one measure of scatter, which is the standard deviation. 
This is a measure of the spread of x from E(x). If an arbitrary point a is taken, then 


\E [x — a]? is a measure of scatter of x from the point a, which is also called the mean 
deviation of x from a. Other measures of scatter from a are the following: 


1 
r 


M,(a)-E[lk-al'], r2412... (4.10) 
M,(a) = E|x - a| = mean absolute deviation from a (4.11) 
All these qualify to be mathematical distances of x from the point a and in this sense 
are measures of scatter of x from the point a. Out of these, the mean deviation from 


a and mean absolute deviation from a are very important because they are very often 
used for statistical decision making. Hence we will list two basic properties here. 


Result 4.4. The mean deviation from a is least when a = E(x). 


Proof. Minimization of 4E(x— a)? also implies the minimization of the square 
E[x - a]? and vice versa. Consider E[x — a]*, add and subtract E(x) inside and ex- 
pand to obtain the following: 
E(x - a]? = E[x - E(x) + E(x) - a]? 
= E[x - Eco]? - 2E[x - EGO] [EGO - a] + (E(x) - a? 
= Var(x) + [E(x) - al’ 
since E(x) - a is a constant and E[x — E(x)] = 0. But the right-hand side has only one 


term depending on a and both terms are non-negative. Hence the minimum is attained 
when the term containing a is zero, that is, [E(x) - a] = 0 2 a = E(x). 


This result is very important in model building problems, in regression analysis, 
etc. 


Result 4.5. The mean absolute deviation is least when the deviations are taken from 
the median or E|x — αἱ is least when a = the median of x. 


The proof is slightly longer and it will be listed in the exercises. 


4.2.3 Some properties of variance 


(i) Var(c) - O when c is a constant. 
The proof follows from the definition itself. 


92 —— 4 Expected values 


Var(x) = E[x - EGO? = E[c - c - 0 (4.12) 
since E(c) = c when c is a constant. 


(ii) Let y = x + a where a is a constant. Then 
Var(y) = Var(x + a) = Var(x) 


or the relocation of the variable x does not affect the variance. 
When a constant is added to the variable, we say that the variable is relocated and 
a here is the location parameter. Observe that E(x + a) = E(x) + a and then 


y - E(y) = (x +a) - (E(x) + a) = x - E(x) 


thus a is canceled. As an example let us consider the following problem. One pump- 
kin is randomly selected from a heap of pumpkins and weighed. Let χι be the weight 
observed. Later it came to the attention that the balance was defective and it always 
showed 100 grams more than the actual weight. If x is the true weight of the pump- 
kin, then the observed weight χι = x + 100, weight being measured in grams. Will this 
faulty balance affect the mean value and variance of the true weight of a pumpkin 
selected at random? Let y = x + c where x is a random variable and c is a constant. 
Then, denoting the mean value by u (mu) and variance by o? (sigma squared), we 
have 


p= Ε(Υ) =E(x +c) =E(x) «c (4.13) 
The mean value is affected: 
o? = Var(y) = E[y - Eg)" 
= E[(x c) - (E(x)  c)]? = E[x - EGO? = Varo). (4.14) 


Note that when the deviation y — E(y) is taken the constant is eliminated, and hence the 
relocation of the variable will affect the mean value but it will not affect the variance. 

The pumpkin was weighed in grams. Suppose we want to convert the weight into 
kilograms. Then 


xgrams > z= 


x kilograms. 
1000 


Let us compare the variance of x and variance of z. In general, let z = bx where b isa 
constant and let us denote the variances of z and x by o2 and 02, respectively, and the 
corresponding standard deviations by g, and σι. Then by definition 
o2-E|z-E(z)]|] where E(z) = E(bx) = bE(x) 
= E[bx - bE) |? = P?E[x - EG)? = b? Var(x) 
= b’o;. (4.15) 


4.2 Expected values —- 93 


Thus the variance of z is square of the scaling factor times the variance of x. Since 
standard deviation is the positive square root of the variance 


ο; = |b|o,. (4.16) 


Remember to take the absolute value. If the original b is —2, then the multiplying factor 
for the standard deviation is |-2| = 2 and not -2. The term “scaling” came from multi- 
plication by a positive constant but nowadays it is used to denote a constant multiple 
whether the constant is positive or negative, as long as it is not zero. If we look at a 
scaling and relocation, that is, let t = cx + d where c and d are constants. Then 


t=cx+d = Var(t)=c’Var(x); σιΞ|εσ,. (4.17) 


Some general properties may be observed. In general (for some special cases they may 
hold), 


(11) o2 = E(x - E(x)]* does not imply σε = E[x - E(x)]; 
(iv) o2 = E[x - E(x)]* does not imply o, = E[|x - E(x)|]; 


(v) o2 = 0 if and only if x is a degenerate random variable. That is, if x is degenerate 
then o2 = 0 and if o2 = 0 then x is degenerate. That means the only random variable 
where variance is zero is the degenerate random variable. If and only if is usually ab- 
breviated as“iff”. 


(vi) σε > 0, o, > 0 always except for the degenerate case where 02 = 0. That is, the 
variance, thereby the standard deviation, of a random variable is non-negative. 


Exercises 4.2 


4.2.1. Compute (i) E(x); (ii) Ep? — 2x + 5]; (iii) E[x — 2]?; (iv) Var(x) for the random 
variable in 


0.5, x=-1 
Εία)Ξ 40.5, x=1 


0, elsewhere. 


4.2.2. Compute the same items in Exercise 4.2.1 for the random variable in 


fü |; O<x<2 


O, elsewhere. 


Determine the normalizing constant c first. 


94 — 4 Expected values 


4.2.3. Compute the same items in Exercise 4.2.1 for the random variable in 


αλ, O<x<1 
f(x)=42-x, 1«xx2 
0, elsewhere. 


Evaluate the normalizing constant c first. 
4.2.4. Construct two examples where E(x) = 0, Var(x) = 1 in discrete case. 
4.2.5. Construct two examples where E(x) = 0, Var(x) - 1 in a continuous case. 


4.2.6. Compute E[x — 2]? for the Exercises in 4.2.1, 4.2.2, 4.2.3 first directly by taking 
the expected value of E[x — 2]? and then by expanding (x -- 2)", taking the expected 
values and simplifying. Verify that both procedures give the same results. 


4.2.7. Construct a probability/density function of x where E[|x|] = 0. 
4.2.8. Let 


oe(l-x)*, Oxx«1 
fœ) = 
0, elsewhere. 


be a density. Compute (i) the normalizing constant c; (ii) the median point; (iii) mode 
or modes. 


4.2.9. For Exercise 4.2.3, compute (i) c; (ii) the median; (iii) the mode or modes. 


4.2.10. Construct a density function of a random variable x where E(x) exists whereas 
E[x?] does not exist. 


4.2.11. In Exercises 4.2.1-4.2.3, compute the following probabilities: (i) Pr{-2 < 
x € 0.8}; (ii) Pr{0.5 < x < 1.8}; (iii) mark the areas in a graph in (i) and (ii) for Exer- 
cises 4.2.2 and 4.2.3. 


4.2.12. The monthly income x of households in a city is found to follow the distribu- 
tion with the density: 


fto- x» 1000xx«x30000 
O, elsewhere. 


Compute (i) c; (ii) the expected income or mean value of the income; (iii) the median 
income in this city so that 5096 of the households have income below that and 5096 
above that. 


4.2.13. The waiting time t at a bus stop is found to be exponentially distributed with 
the density 


4.3 Higher moments — 95 


1.-£ 

ze 5, Oxt«oo 
f( 2i? l 

0, otherwise, 


time being measured in minutes. Compute (i) the expected waiting time u (mu) for this 
bus; (ii) the standard deviation o for this waiting time; (iii) the probability Pr(u - o < 
t < u +0} or the waiting time being one standard deviation away from the mean value; 
(iv) Pr{|x — u| x 20}; (v) Pr{|x - u| x 30}. 


4.2.14. For the exponential density 


ge 5, Ozxx«oo 
fœ) = 
0, elsewhere 


prove that the mean value of x, that is, E[x] is this parameter 0. Compute the variance 
also. [Observe that 0 is a scaling parameter here.] 


4.2.15. The life time t of a certain type of electric bulbs is found to be exponentially 
distributed with expected life time 1000 hours, time being measured in hours. One 
bulb from this type of bulbs has already lasted for 1500 hours (i.e., t > 1500). Let A 
be this event. Let B be the event that it will last for at least another 500 hours more. 
Interpret the events A, A n B and then compute the probability that it will last at least 
another 500 hours more given that it has already lasted for 1500 hours. 


4.3 Higher moments 


In Section 4.2, we define expected value in general as well as mean value - E(x) and 
variance = E[x — E(x)]*. Here, we define certain expected values which are called the 
moments. The concept of moments originally came from physics but used heavily in 
statistical literature. 


Notation 4.5. p,’: r-th integer moment about the origin. 


Definition 4.5 (Moments about the origin). The a-th moment or the a-th moment 
about the origin of a random variable x is defined as 


Abe | = | x"f(x)dx when x is continuous 


Y, x%f(x) when x is discrete 


-οο«χ«οο 


and when E(x^] exists. Here, a could be any complex number. When a is a positive 


96 —— 4 Expected values 


integer, a =r = 1,2,... then the r-th moment is denoted by p, or 
πι ΞΗΡΟ], set. (4.18) 
In a mixed case, where some probability mass is distributed on some individually 
distinct points and the remaining on a continuum of points, take x^f (x), sum up over 


the discrete points and integrate over the continuum of points to get E[x^]. Observe 
that E[x?] - E[1] - 1. 


Notation 4.6. u, = r-th central moment for r = 1,2, ... 


Definition 4.6 (Central moments). The a-th central moment is defined as E[x -- 
E(x)]* for any complex number a, whenever it exists, and when a =r 2 1,2, ... then 
the r-th central moment is defined as 


REDE Pes (4.19) 


The a-th moment about any arbitrary point a is defined as E[x - a]* whenever this 
expected value exists. Thus when a - E(x) we have the central moments. 


Notation 4.7. µη - r-th factorial moment. 


Definition 4.7 (Factorial moments). The r-th factorial moment is defined as 
μη = E[x(x -Dx -2)--- (x —r & 1)] (4.20) 


whenever it exists. 


Factorial moments are usually easier to evaluate compared to moments about the 
origin or central moments when factorials are involved in the denominator in some 
probability functions for discrete variables. This will be seen when we evaluate mo- 
ments in binomial or Poisson probability functions later on. Before proceeding further, 
let us look into some examples. 


Example 4.7. Compute (i) EG); (ii) E[x — E(x)]^; (iii) E|x|; (iv) EQ? - 3x + Δ) for the 
following probability function: 
05, x=-2 
f(x)=405, x=2 


0, elsewhere. 


Solution 4.7. Since it is a discrete case we sum up. Denoting summation over x by Σ. 
we have, 


4.3 Higher moments — 97 


(i) EQ?) = Y. Xf (x) = (-2) (0.5) + (2? (0.5) + 0 = -4 + 4 =0. In this case, it is easily 
seen that E(x’) = 0, r = 1,3,5,..., or all odd moments are zeros here, since the proba- 
bility function is symmetric about x = 0. 

(ii) Since E(x) - 0 here, we have 


E[x — ΕΙ’ = E[x^] = Y x feo 
= (-2)*(0.5) + (2)^(0.5) + O = (16)(0.5) + (16)(0.5) = 16. 
(iii) Here, we take the absolute values 


ΕἸΧ| = 2: ixl = [(72)|(0.5) + |(2)|(0.5) + 0 = (2)(0.5) + (2)(0.5) = 2. 


(iv) Here, we can use the property that the expected value of a sum is the sum of 
the expected values: 


E(x? - 3x + Δ) = E(x?) - 3E(x) + E(4) = E(x’) - 3(0) +4 
since E(x) = 0 and since E(4) = 4 and = 4+4 = 8 since E(x?) = (—2)?(0.5) + (2)2(0.5) = 4. 


Example 4.8. Answer the same questions in Example 4.7 for the following density: 


X, O<x<l 
f(x)=42-x, 1xxx2 
0, elsewhere. 


Solution 4.8. Since the variable is continuous here, we will integrate. From -oo to 0, 
the function f (x) is zero and the integral over zero is zero: 


3 3 | 3 og 
E(x )=| x flodx=0+| x cod | [x^(2 - x)]dx 


since f(x) has different forms we integrate on each piece, and then 


541 4 542 
&6- [7] « [5 - 5] 
0 1 
1 15 3 3 
Ed 5A 


Since x = 0 here, |x| = x itself. Let us compute E(x). 


E(x) =0+ | xaar + [ ke-vlax " : + zd 


This may also be seen from the areas in the density function when you graph the den- 
sity. The mid-value is 1. Thus, (iii) is answered, the value is 1. Let us compute E(x?): 
7 


2 ba 22 
E(x )=0+| x cod | x Q -xydx- z. 


98 ---- 4 Expected values 


Then (iv) can be answered: 


E(x? - 3x + 4) = E(x?) -3E(x)+4= - -3(1)+4= a 
Now, (ii) can be answered in two ways, either by expanding E[x — E(x)]^ first by using 
the binomial expansion and then taking the expected values or by substituting for E(x) 


first and then taking the expected values directly. Let us expand after substituting the 
value for E(x) - 1, by using a binomial expansion. That is, 


E[x - E(x)]* 
= E[x - 1^ = E[x +(-1)]* 


BORN LANE A RENS A en Oe PEAT 
-r |x «(e ΠΗ 1) EMT 1) οι 1) | 


= E[x* - 4x3 6x? - 4x +1] = E(x^) - 4E(x?) + 6E(X?) - 4E(x) +1 


= B(x") - (2) +6(2) -4 «12 56-2 


Now, we will compute E (x^). 


; ; 31 
E(x^) - 0 | x^(x)dx + | x4(2—x)dx = τ-. 
0 1 15 
Hence 
31 1 
E[x-E(o]* = = -2=—. 
Ix 69] 15 15 


Example 4.9. Prove that the following function is a probability function for a discrete 
random variable and then compute the second factorial moment: 


p Pe, x=0,1,2,...,A>0 
χ) - 4 τ 
O, elsewhere. 


[This probability function is known as Poisson probability function, named after its 
inventor, S. Poisson, a French mathematician.] 


Solution 4.9. Since exponential function is non-negative, f (x) here is non-negative 
for all values of x and A > 0. Here, À > 0 is a parameter. The total probability 15 


Στο) =eA 92 a 
χ x=0 °° 


But 


4.3 Higher moments — 99 


by the exponential series. Hence the total probability is 


eg = M = eO =1, 


Hence f (x) here is a probability function. The second factorial moment 


oco X " 1 oco AX 
= E[x(x - 1] = -1) e =e 23. 
Hg = E[x(x - 1] 2,16 ye es 240 5 


But note that when x = 0 or x = 1 the right side is zero, and hence the sum starts only 
from x = 2, going to infinity. When x goes from 2 to oo or when x + 0 and x #1, we can 
divide both numerator and denominator by x(x -- 1) or we can cancel x(x - 1). That is, 


xx-1 1 
x! — (x-2) 


But this cancellation, or division of numerator and denominator by the same quantity, 
was not possible if x could be zero or 1 because division by zero is impossible. Now, 
we can sum up. For convenience, let us take A? and e^ outside. Hence 


oo An 


NEUE 
E[x(x - 2] 2 Ae δα δι 


2 
= Pel 1+ Ky xt] 


Here, the sum was opened up by putting x = 2, x = 3, x = 4 and so on and writing up 
the terms. We could have also used the procedure of substitution. Put y = x - 2 then, 
when x = 2, we have y = 0 and then 


which would have also yielded the same result. 


In this example, we may note one interesting aspect. Since x! was sitting in the 
denominator, we could easily cancel factors such as, x, x(x - 1), x(x - 1)(x - 2), etc., 
and thus factorial moments are easy to evaluate. But we cannot cancel x? because 
after canceling one x still there is one more x left out and the denominator has become 
(x - 1)!. If we want to compute E[x?] in this case, we can use the identity 


x(x-1)2x!-x 2x! zx(x-1)4x. 


Higher moments about the origin can be computed from factorial moments by using 
such identities in this Poisson case because of the presence of x! in the denominator 
of the probability function. 


100 —— 4 Expected values 


4.3.1 Moment generating function 


There are several types of generating functions which will all generate integer mo- 
ments about the origin. 


Notation 4.8. M(t): Moment generating function of a random variable x. 


Definition 4.8 (Moment generating function). The moment generating function 
of a random variable x or of the probability/density function f(x) is defined as the 
following integral: 


M(t) = | e"f(x)dx when x is continuous 


-οο 


= Y e"f(x) when xis discrete 


-οο«χ«οο 


whenever the sum/integral exists. In the mixed case, integrate over the continuum 
of points and sum up over the discrete points where there are non-zero probability 
masses. 


Why is M(t) called the moment generating function? Let us expand e”. That is, 
for example, for the continuous situation, 


M(t) = κ ef (x)dx 


-οο 


e tx t?x? 
- NIE m + x + [food 
z | Flodx + t | xf (dx 4---- 
if term by term integration is possible, and 
t TN, OPES 
M(t) =1+ ΠΕ) 1 EC )rq EG J+ (4.21) 


Thus the coefficient of : in the above power series is E(x") the r-th integer moment 
of x for r = 0,1,2, .... Thus all the integer moments about the origin are generated by 
this function M(t), and hence it is called a moment generating function. Thus, if M(t) 
exists and admits a power series in t then the coefficient of 5 isther-th integer moment 
E(x’). If M(t) is differentiable, then you may differentiate successively and evaluate at 
t = 0. From (4.21), note that 
͵ n. 
u, =E[x"] = ag MO iv (4.22) 
t 


= coefficient of zr in the series. (4.23) 
r! 


4.3 Higher moments —— 101 


Differentiation is possible when M(t) is differentiable and series expansion is possible 
when M(t) can be expanded into a power series. 

This M(t) may not exist for many of the probability/density functions. Let us check 
one example. 


Example 4.10. Check whether the moment generating function M(t) exists for the 
following density: 


joo | 1<x<0o 


O, elsewhere. 


Solution 4.10. It was already shown earlier that this f(x) is a density function. Here, 


M(t) = Γ e*f(x)dx 204 [ 1 edx. 


-οο 1 x? 


Integrating by parts, taking u = e and dv = 3 so that v = | zdx =- 1 and then using 
the formula f udv = uv - f vdu we have 


Μ(θ- [- χα] sef Ταν 
X 1 1 X 


1 ο] (o 
- |--e5] *t[Inxe*]? - t? | In xe*dx. 
X 1 1 


The first term goes to —co for t > 0 since the exponential term increases faster than x. 
The integral does not exist for t » 0. There is another generating function which will 
exist always when f (x) is a density. This is known as the characteristic function of the 
random variable x. 


Notation 4.9. $(t) = the characteristic function of x or of f (x). 


Definition 4.9 (Characteristic function). The characteristic function $(ft) is de- 
fined as 


P(t) = | elF(x)dx, i= V-I when x is continuous 


CO. 
= ` ei*f(x) when x is discrete. (4.24) 
-οο«χ«οο 
Mixed cases can be handled as stated before. Since 


le'*| = Icos tx + isin tx| = Vcos? tx + sin? tx =1 (4.25) 


we have 


102 —— 4 Expected values 


iet) = ΙΝ e'^fGodx| < I le™ |Foddx 


-οο 


= W fGodx -1 (4.26) 


by definition. Hence the integral is always convergent. 


Mathematical concepts corresponding to moments, moment generating function, 
etc. are given in a note below. Those who are not familiar with complex variables may 
skip the notes and go directly to Chapter 5. 


4.3.2 Moments and Mellin transforms 


We have defined arbitrary moments. One such moment is called the Mellin transform 
of a function. Consider a function f (x) defined over 0 x x « co and consider an integral 
of the following type. 


Notation 4.10. M,(s): Mellin transform of the function f(x), with the Mellin pa- 
rameter s. 


Definition 4.10 (Mellin transform). The Mellin transform ofa function f (x) is given 
by 


My(s) = fe xf (x)dx (4.27) 


provided the integral exists, where s is a complex variable. Note that when f(x) isa 
density function then it is nothing but the (s -- 1)-th moment of x. Mellin transform 
is defined for f(x) where f(x) need not be a density. In Example 4.8, we could have 
computed arbitrary moments of the type Ep? 1] where sis a complex variable. Thus 
in this case the Mellin transform of the density function in Example 4.8 exists. 


One question that is often asked is that suppose that you know a function of s, 
which is the Mellin transform of some unknown function f(x), can we determine f(x)? 
Then f(x) will be called the inverse Mellin transform. For statisticians, this problem is 
of great interest. We may be able to come up with a Mellin transform through some pro- 
cedure. Then the problem is to determine the unknown density which produced that 
Mellin transform. The formula to recover the unknown function from a given Mellin 
transform is the following: 


1 οξίοο E 
{6)- οτί |... My(s)x *ds (4.28) 


4.3 Higher moments —- 103 


where i = V-1. This formula holds under some conditions on the unknown function 
and the known M,(s). This integral is a contour integral or an integral in the complex 
plain. Since this area is beyond the scope of this book, we will not elaborate on this 
aspect here. 


Notation 4.11. F,(t): Fourier transform of f (x) with parameter t. 


Definition 4.11 (Fourier transform). Fourier transform of a function f(x), with pa- 
rameter t is defined as 


eco ch 

F(t) = | ο ἴαρθὰχ, i= v (4.29) 
-οο 

This integral is in the complex domain, and hence we will not elaborate here. Note 

that the characteristic function (t), and the Fourier transform F p(t) when f(x) 15 8 

density function, will also generate the moments. Fourier transform is defined for 

arbitrary functions, need not be density functions. 


Another generating function which is applicable on positive random variables is 
theLaplace transform. This will also generate moments when f (x) is a density function 
but the Laplace transform is defined over all f(x) as long as the integral exists. 


Notation 4.12. L;(t): Laplace transform of the function f (x), with the Laplace pa- 
rameter t. 


Definition 4.12 (Laplace transform). The Laplace transform of a function f(x) is 
defined as the following integral: 


te |. e^" f oodx (4.30) 


whenever this integral is convergent, where t is called the Laplace parameter or the 
parameter in the Laplace transform. 


Such an integral need not exist for all functions f (x). Let us expand e™. If f (x) is 
a density function for a positive random variable x and if L(t) admits a power series 
expansion orifallinteger moments exist, then as in the case of the moment generating 
function, one can obtain the integer moments: 


r 
E(x") = (-1) E when L;(t) is differentiable (4.31) 
t-0 
A 
= coefficient of ( D 


when L;(t) is expansible. (4.32) 


104 — 4 Expected values 


Let M,(t) be the moment generating function of a real random variable x then the 
following properties follow from the definition itself: 


(1): lim M(t) =1; (2): Malt) = My(at); (3): Μαχιρ = ^M, (at) (4.33) 


where a and b are constants. Similar properties hold for characteristic function and 
Laplace transform (for positive variables). 


4.3.3 Uniqueness and the moment problem 


For a given random variable x, whether it is continuous, discrete or mixed, there is 
only one distribution function (cumulative probability or density), one moment gen- 
erating function, one characteristic function, corresponding to this variable. There 
cannot be two different density functions or two different moment generating func- 
tions, etc. corresponding to a given random variable. What about moments? Suppose 
that all integer moments are available, that is, suppose that E(x"), r = 0,1,2, ... are all 
fixed or given. Will the random variable x be uniquely determined by this moment se- 
quence? This is known as the moment problem. This problem originated in physics. 
The answer is that these moments, even though countably infinite of them are avail- 
able, still these need not uniquely determine the random variable or there can be more 
than one random variable giving rise to the same moment sequence. Some sets of suf- 
ficient conditions are available in the literature under which a moment sequence will 
uniquely determine the random variable. One such condition is that the variable has 
a finite range with non-zero density/probability. The non-zero part of the density is in 
the finite range a x x x b where a and b are finite. One such example is the uniform 
density. 

What about the Mellin transform, which will be (s — 1)-th moment when f(x) is a 
density for x = 0, willitdetermine the density? In Mellin transform M (x) 2E (575), sis 
a complex variable. As long as s is defined on a strip in the complex plain where the 
function M,(s) is analytic, then we can show that f(x) is uniquely determined under 
some minor additional conditions and inverse Mellin transform, given in (4.28), is the 
unique determination of f(x). Detailed conditions for the existence of Mellin and in- 
verse Mellin transform may be seen from the book [2], which is available at CMS. Sim- 
ilarly, from the given moment generating function, characteristic function, Laplace 
transform, Fourier transform, the function f (x) can be uniquely determined under spe- 
cific conditions. These formulae are known as the inverse Laplace transform, inverse 
Fourier transform, etc. What we will do later is not the evaluation of these inverse 
transforms by using complex analysis but remembering that these inverse transforms 
will uniquely determine the original function f(x) we will identify the transforms with 
known transforms and write up f(x). 


4.3 Higher moments —- 105 


Exercises 4.3 


For the following probability/density functions evaluate the moment generating func- 
tion M(t) and then obtain the r-th integer moment by (i) differentiation when M(t) 
is differentiable, (ii) by series expansion when M(t) can be expanded as a power se- 
ries: 

0.7, x--1 
4.3.1. f(x) 40.3, x=2, 


0, elsewhere. 


pa, x=1,2,...,0<p<1,q=1-p 


0, elsewhere. 


4.3.2. f(x) = | 


1 


4.3.3. f(x) - | 3) 


0<x<3 


O, elsewhere. 


0e 9 x>0,0>0 


4.3.4. f(x) = 
0, elsewhere. 


4.3.5. Compute E(x^) for a a complex number for Exercises 4.3.1 and 4.3.3. 


4.3.6. Prove that E|x — a| is a minimum when a is the median of the random vari- 
able x. 


5 Commonly used discrete distributions 


5.1 Introduction 


In this chapter, we will deal with some discrete distributions and in the next chap- 
ter we will consider continuous distributions. The most commonly appearing discrete 
distributions are associated with Bernoulli trials. In a random experiment if each out- 
come consists of only two possibilities, such as in a toss of a coin either head H or 
tail T can come, only H or T will appear in each trial, only two possibilities are there, 
then such a random experiment is called a Bernoulli trial. If a student is writing an 
examination and if the final result is to be recorded as either a pass P or a failure F, 
then only one of the two possibilities can occur. Then attempting each examination is 
a Bernoulli trial. But if the final result is to be recorded as one of the grades A, B, C, D, 
then there are four possibilities in each outcome. Then this is not a Bernoulli trial. 
When a die is rolled once and if we are looking for either an odd number O or an even 
number E, then there are only two possibilities. It is a Bernoulli trial. But if our aim is 
to see which number turns up then there are 6 possibilities, that is, one of the numbers 
1,2,3,4,5,6 can appear or there are 6 possible items or possibilities in an outcome. It 
is a multinomial trial. It is not a Bernoulli trial. 

In a Bernoulli trial, let the possible events in each outcome be denoted by A and B. 
Then A NB = $ and AUB =S = the sure event. Let the occurrence of A be called “a suc- 
cess” and the occurrence of B “a failure". Let the probability of A be p. Then 


P(A)=p, P(B)= P(A‘) =1- P(A) -1-p-q 


where 1 — p is denoted by q. If a balanced or unbiased coin is tossed once and if getting 
a head is a success, then P(A) - 5 and if the coin is not unbiased then P(A) # 5. When 
a balanced die is rolled once and if A is the event of getting the numbers 1 or 2, then B 
is the event of getting 3 or 4 or 5 or 6. In this case, 


2 


P(A) === : and P(B)= 2 


ον 
σι m 


5.2 Bernoulli probability law 


Let x be the number of successes in one Bernoulli trial. Then x = 1 means a success 
with probability p and x = 0 means a failure with probability q. These are the only two 
values x can take with non-zero probabilities here. Then the probability function in 
this case, denoted by f; (x), can be written as 


x ix. 20,1 
fio = | q χ 


0, elsewhere. 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. JBA] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https://doi.org/10.1515/9783110562545-005 


108 —— 5 Commonly used discrete distributions 


Here, p is the only parameter. This is known as the Bernoulli probability law. The mean 
value E(x), variance o? = E[x - E(x)]? and the moment generating function M(t) are 
the following: Since it is a discrete case, we sum up: 


E(x) = Σο =0 + (0)[p?q"?] + (D[p!q* 1] =p. (5.1) 
)-Xehe- 0 + (0 [pq] + (1)? [p'a] =p. 


Var(x) = E(x?) - [EW]? =p - p? =p- p) = pa. (5.2) 
M(t) = LEE 0 + et O [p?q 1- 9] + et (D [p!q 1- η 
-qa pef. (5.3) 
We may note that this M(t) can be expanded in power series and it can be differentiated 


also. We can obtain the integer moments by expansion or by differentiation: 


M(t)- 1 Eat 
ΠΠ ΠΠ 


Therefore, the coefficient of ῃ a ; is p = E(x) and the coefficient of Ë ; is p = E(x’). Higher 
integer moments can also obtained from this series. Now, considei differentiation: 


gr, <a + pe ‘> [ρε΄]|.ο =P. 


ο... 


Example 5.1. A gambler gets Rs5 if the number 1 or 3 or 6 comes when a balanced 
die is rolled once and he loses Rs 5 if the number 2 or 4 or 5 comes. How much money 
can he expect to win in one trial of rolling this die once? 


Solution 5.1. This is nothing but the expected value of a Bernoulli random variable 
with p = 5 since the die is balanced. Hence 


E(x) =0 + ()(5) + (-5)(5) = 


It is a fair game. Neither the gambler nor the gambling house has an upper hand, the 
expected gain or win is zero. Suppose that the die is loaded towards the number 2 or 4 
or 5 and suppose that the probability of occurrence of any of these numbers is : then 
the expected gain or win of the gambler is 


5 
3 


B(x) =0+(5)(5) +( (5) = 


or the gambler is expected to lose Rs 5 in each game or the gambling house has the 
upper hand. 


5.3 Binomial probability law  —— 109 


5.3 Binomial probability law 


Suppose that a Bernoulli trial is repeated n times under identical situations or consider 
nidenticalindependent Bernoulli trials. Let x be the total number of successes in these 
n trials. Then x can take the values 0,1,2, ...,n with non-zero probabilities. In each 
trial, the probability of success is p. A success or failure in a trial does not depend 
upon what happened before. If A, is the event of getting a success in the second trial 
and if A, is the event of getting a failure in the first trial then 


P(A)-q, P(AjA)-P(Aj) -p, P(AnB)- qp. 


where P(A n B) is the probability of getting the sequence "failure, success". Suppose 
that the first x trials resulted in successes and the remaining n - x trials resulted in 
failures. Then the probability of getting the sequence ὃς... SFF ... F, where S denotes 
a success and F denotes a failure, is 


pp--pqq--q-p'q"*. 


Suppose that the first three trials were failures, the next x trials were successes and the 
remaining trials were failures then the probability for this sequence is gqqp*q --- q = 
p*q" *. For any given sequence, whichever way S and F appear, the probability is 
p*q" *. How many such sequences are possible? It is (7) or (,".). Hence if the prob- 
ability of getting exactly x successes in n independent Bernoulli trails is denoted by 
f; (X) then 


fy) = (Bp'q 5 x20. Ἡ 
5 0, elsewhere, O«p«1, q=1-p, n=1,2,.... 


Note that n and p are parameters here. What is the total probability in this case? 
n 
Y50)-0« Y (re =(q+p)"=1"=1, 
x x=0 


see equation (3.12) of Section 3.3 for the binomial sum. The total probability is 1 as can 
be expected when it is a probability law. Since f,(x) is the general term in a binomial 
expansion of (q +p)”, this f,(x) is called a Binomial probability law. What are the mean 
value, variance and the moment generating function in this case? 


EQ) = Y xf,(x) = Σχ (0) pq 


x x=0 
is n 

= x pq 
del") 


since at x = 0, xf,(x) = 0. For x + 0, we can cancel x or divide numerator and denomi- 
nator by x. We can rewrite 


* n = n! |“ πὶ 
x)  ix(n-x)l (x-1)(n-x) 


110 —— 5 Commonly used discrete distributions 


since for x + O we can cancel x, 


n! sh (n-1)! E n-1 
(x-1!n-x! i(x-0D(n-x!)] x-1 


and 


pg = ppa], 


Therefore, 


x x=1 
N/N 
=m > ( ΗΝ y=x-1,N=n-1 
y=0 ^Y 
-np(q*p)"!-np sinceq+p=1. 


Therefore, the mean value here is 
E(x) =np. (5.4) 
For computing the variance, we can use the formula 
σ᾽ = Ex - Eco]? = E(x?) - [EGO]. 


Let us compute E(x?) first: 
n! 
2 x2 n: p* n-x 
x^f)(x) 20 
7 fX y i a x\(n—x)! x)! Pg 


since at x = 0 the right side is zero, and hence the sum starts from x = 1. We can cancel 
one x with x! giving (x — 1)! in the denominator. But still there is one more x in the 
numerator. But we can see that since a factorial is sitting in the denominator it is easier 
to compute the factorial moments. Hence we may use the identity and write 


x? 2 x(x - 1) +x. 


Now, we can compute E[x(x - 1)], and E(x) which is already computed. 


n n! 


EE ge ES gra = 
-Σκα- «o aw? a 


since at x = 0, x = 1 the right side is zero. That is, 


n 


E[x(x-1)] = 2 zs apa 


5.4 Geometric probability law. ---- 111 


Now, take out n(n — 1) from nl, take out p? from p*, rewrite n — x = (n - 2) - (x - 2), 
substitute y = x - 2, N = n - 2. Then we have 


N 
E[xx - Ὁ] 2 n(n - 0p? Y Gr α Y - n(n - 1)p*(q + p)" - n(n - p? 
y=0 


since (q + p) = 1. Therefore, 


o? = E[x?] - [EGO 2 n(n - 0p? + np - (np? 
-np- np? = np(1 - p) = npq. (5.5) 


Thus, the mean value in the binomial case E(x) - np and the variance o? - npq. Let us 
compute the moment generating function: 


M(t = $e" 00 - Σ (ere 


x x=0 


= Ἡ (pe) 'q'* = (q pe)". (5.6) 
x=0 


For a binomial expansion, see Section 3.3. Note that, the integer moments can be eas- 
ily obtained by differentiation of this moment generating function. Expansion can be 
done but it is more involved. Deriving the integer moments by differentiation is left to 
the students. 

Before doing some examples, we will introduce two more standard probability 
functions associated with Bernoulli trials. 


5.4 Geometric probability law 


Again, let us consider independent Bernoulli trials where the probability of success 
in every trial is p and q = 1 - p. Let us ask the question: what is the probability that 
the first success is at the x-th trial? The trial number is the random variable here. Let 
f(x) denote this probability function. If the first success is at the x-th trial, then the 
first x — 1 trials resulted in failures with probabilities q each and then a success with 
probability p. Therefore, 


α΄ Ἵν, x-12.. 
f. 3 (x) = 
0, elsewhere. 
Note that p is the only parameter here. The successive terms here are p,pq,pq’,... 
which are in geometric progression, and hence the law is called the geometric proba- 
bility law. The graph is shown in Figure 5.1. 


112 —— 5 Commonly used discrete distributions 


X 
X 
x 
X 
X X X X 
0 1 2 3 


Figure 5.1: Geometric probability law. 


Let us see the sum of the probabilities here. The total probability 


YAO = Y a p=pf1+q+0 +} 
x x=1 
-p(1-q)! seethe binomial expansion in Section 3.3 


=pp'=1 


as can be expected. Let us compute the mean value E(x), variance σ΄ and the moment 
generating function M(t) for this geometric probability law: 


E(x) = Σχπο) E Y xq" p =p{1+2q+3q7+---} 
x x-1 
-p(1-q)?, O«q«1 seeSection 3.3 for the binomial sum 


1 
εσας, ο«ρς1. (5.7) 


We can also derive this by using the following procedure: 


τ X— c d X 
E(x) = Y xq pp) | <9 | 
x=1 x=1 q 
258 S j 
dq X=1 


dd ΠΝ: ο 
T a ορ q)1] 


-p[ü-4)!*q0-)?] siyi 
| q*p 1 
p p 


For computing Ε(Χ2), we may observe the following: 


xq p=p aite]. 
dq! dq 


5.5 Negative binomial probability law 


Hence 


-pŠ laa- «q?0-4)?] 2 p1- a)! 
q 


2 
+ p[3q(1 - 9)? 12421 - )?] 214 E p. 


We can also obtain this from the moment generating function: 


M(t) = Y eX pq = p[e! + ge? + q?et +--+} 


x=1 
=pe'(1—qe')' forge’ <1. 


Differentiating M(t) with respect to t and then evaluating at t = 0, we have 


Bo) Μα -σ) 


= {pe a ) ! «pe'(1- ge’) "qe! ]|, o 


1 
-pp*pp?q-1 ^ =. 


p 
re) Τα jMe| - a, {pe (1 get)! + pae (1 - qe) ?], 
= {pe can + pe'qe'(1- qet)? + 2pge^ (1 - ge‘) νι 
2168150158 3440158. 
p p p p p 


5.5 Negative binomial probability law 


— 113 


(5.8) 


(5.9) 


Again, let us consider independent Bernoulli trials with the probability of success p 
remaining the same. Let us ask the question: what is the probability that the x-th trial 
will result in the k-th success for a fixed k, something like the 10-th trial resulting in 
the 7-th success? Let this probability be denoted by f, (x). The k-th success at the x-th 
trial means that there were k — 1 successes in the first x — 1 trials; the successes could 
have occurred any time in any sequence but a total of k — 1 of them. This is given by a 
binomial probability law of x — 1 trials and k - 1 successes. The next trial should be a 


success, then one has the x-th trial resulting in the k-th success. Hence 


feo = J cOp qe = optat, x= Kk +h. 
: 0, elsewhere. 


114 —— 5 Commonly used discrete distributions 


Note that one has to have at least k trials to get k successes, and hence x varies from 
x =k onward. Here, p and k are parameters. What is the total probability here? 


- r 
x | {k-1 R k " k+1 i 
ο: 
-pk q q 
=p [iHi keen e] 


=pk(i-q)*=p*p* =1 


as can be expected. Since f, (x) is the general term in a binomial expansion with a neg- 
ative exponent, this probability is known as the negative binomial probability function. 

Naturally, when k = 1 we have the geometric probability law. Thus the geomet- 
ric probability law is a particular case of the negative binomial probability law. Let 
us compute E(x), E(x*) and the moment generating function. The moment generating 
function: 


e/x-1V ν. eÍ/x-1Y ,. 
M(t) - p* Σα Κρίχ — pk xs ketx 
X= y= 


= pk ὧν $ (ieas -| 


t t2 
-peki εκ +k(k+ eras =} 
=p*ekt(1-qe')* for qe! <1. (5.10) 
This is a differentiable function. Hence 


E(x) = Sm 9] = p' ke (1. ge’) * + ke (1— get) get, o 
t=0 


= kp‘ {p* + qp*} = κι + 4} ae (5.11) 
p) p 
2 


d did 
EQ?) = ο = =| MO] - 
- kpk (ke (1 e. get) * 4 kge*Dt(1 m get) 9 


4 q(k 1)e(**Dt (1 B get) αν +(k+ 1)g?e*Dt(q _ qe) ως 


2 
=k2+ Hos +k)+ Kk +1). (5.12) 


We have given four important probability functions, namely the Bernoulli probabil- 
ity law, the binomial probability law, the geometric probability law and the negative 


5.5 Negative binomial probability law ----- 115 


binomial probability law, connected with independent Bernoulli trials. All these are 
frequently used in statistical literature and probability theory. 


Example 5.2. Ina multiple choice examination, each question is supplied with 3 pos- 
sible answers of which one is the correct answer to the question. A student, who does 
not know any of the correct answers, is doing the examination by selecting answers at 
random. What is the probability that (a) out of the 5 questions answered the student 
has (i) exactly 3 correct answers, (ii) at least 3 correct answers; (b) (i) the third question 
answered is the first correct answer; (ii) at least 3 questions out of the 10 questions to 
be answered are needed to get the first correct answer; (c) the 5th question answered 
is the 3rd correct answer; (ii) at least 4 questions, out of the 10 questions answered, 
are needed to get the 3rd correct answer. 


Solution 5.2. Attempting to answer the questions by selecting answers at random can 
be taken as independent Bernoulli trials with probability of success 3 because out of 
the 3 possible answers only one is the correct answer to the question. In our notation, 
p= 5 q- 5. For (a), it is a binomial situation with n = 5. In (i), we need Pr{x = 3}. From 
the binomial probability law, 


Prix = 3} = (re? = G) (06) 


τὰ (5)(4) 4 40 
υ ὮΙ ae 

When computing the number of combinations, always use the definition: 

@ | n(n-1)-- (n- r1) 


r r! 


It will be foolish to use all factorials because it will involve unnecessary computations 
and often big factorials cannot even be handled by computers. In (a)(ii), we need 


Pr{x = 3}+ Prix =4}+Pr{x=5}=L say. 
Then 


But 


900-0 Gy -SECES -2 


116 —— 5 Commonly used discrete distributions 


The total is 
51 1 
3 81 
For (b), it is a geometric probability law. In (i), x = 3 and the answer is 


2121 4 
31y- (2y τω, 
T P (5) (5) 27 


For (ii) in (b), we need the sum: 
10 
λα +- +0]=pP[+q+- +q] 


x-3 
eS ence (p 


For (c), itis a negative binomial situation with k = 3. In (i), x = 5, k =3, p L q 5 and 


the answer is 
i yeta*=(3)(3) 65) -3 
(ap -(3) τα Να. 


GG) 56] 
ο ο 
(2) en(2) ο ος 


Example 5.3. An experiment on rabbits is designed by taking N = 20 identical rabbits. 
But rabbits start dying out before the experiment is completed. Let n be the effective 
final number. This sample size n has become a random quantity. n could be zero (all 
rabbits died out), n could be 1,2,... and could be N - 20 (no rabbit died). Let 0.1 be 
the probability of a rabbit dying and suppose that this probability is the same for all 
rabbits. Construct the probability law for n. 


Solution 5.3. Here, n satisfies all the conditions for a binomial random variable with 
probability of success p = 0.1 and the number of trials N = 20. Hence the probability 
law for n is, denoted by P,,, 


Pe p pe 2 (2 eros. n- 0,1,...,20. 


Example 5.4. A lunch counter in an office building caters to people working in the 
building. Apart from regular lunches, the counter operator makes an exotic lunch 


5.6 Poisson probability law ---- 117 


packet every day. If the exotic packet is not sold on that day, then it is a total loss; 
it cannot be stored or used again. From past experience, the operator found the daily 
demand for this exotic packet as follows. It costs Rs 5 to make and she can sell it for 
Rs 10. 


(Demand, probability) = (0, 0.1), (1, 0.2), (2, 0.2), (3, 0.3), (4, 0.1), (5, 0.1). 


There is no demand for more than 5. That is, the operator can sell, for example, the 
3rd packet if the demand is for 3 or more. How many packets she should make so that 
her expected profit is a maximum? 


Solution 5.4. If she makes 1 packet, the probability that it can be sold is that the 
demand on that day is for 1 or more packets. The probability for this is 0.1 + 0.2 + 0.2 + 
0.3 + 0.1 + 0.1 = 0.9. It costs Rs 5 and the expected revenue is Rs10 x 0.9 = Rs9 and the 
expected profit is Rs 4. 

[As an expected value of a random variable, this is the following: Let y be her gain 
or loss on a single packet. Then y takes the value +5 (profit) with probability 0.9 [if the 
demand on that day is for one or more] and y takes the value —5 (loss) with probability 
0.1 [if the demand on that day is for less than one or zero]. Then the expected value of 
y, E(y) = 5(0.9) — 5(0.1) = 4. Thus she has an expected profit of Rs 4.] 

If she makes 2 packets, then the cost is 2x 5 = 10. She can sell the first packet with 
probability 0.9 or make the expected revenue of Rs 9. She can sell the second packet 
if there is demand for 2 or more or with the probability 0.2 + 0.3 + 0.1 + 0.1 = 0.7 and 
make the expected revenue 10 x 0.7 = 7. Thus the total expected revenue is 9 + 7 = 16 
and the expected profit is 16 — 10 = 6. 

If she makes 3 packets, then she can sell the third packet with probability 0.3 + 
0.1 + 0.1 = 0.5 and the expected revenue is 10 x 0.5 = 5. Thus the expected profit is 
9+74+5=21-15(=5 x3) =6. 

If she makes 4 packets, then she can sell the 4th packet with probability 0.1+ 0.1 = 
0.2 and the expected revenue is 10 x 0.2 = 2 and the expected profit is9+7+5+2= 
23-4x5=3. 

If she makes 5 packets, then she can sell the 5th one with probability 0.1 and the 
expected revenue is Rs 1, the total cost is 5 x 5 = 25 and, therefore, there is an expected 
loss of Rs 1. Hence she should make either 2 or 3 packets to maximize her profit. 


5.6 Poisson probability law 


We will derive this probability law as a limiting form of the binomial as well as a 
process satisfying some conditions. This law is named after its inventor, S. Poisson, 
a French mathematician. Consider a binomial probability law where the number of 
trials n is very large and the probability of success p is very small but np = A (Greek 
letter lambda), a finite quantity. This situation can be called a situation of rare events, 


118 —— 5 Commonly used discrete distributions 


the number of trials is very large and the probability of success in each trial is very 
small, something like a lightning strike, earthquake at a particular place, traffic acci- 
dents on a certain stretch of a highway and so on. Let us see what happens if n co, 
p — 0, np =A. Since we are assuming A to be a finite quantity, we may replace one of 


n or p in terms of A. Let us substitute p = A, Then 


X 
lim ()r-i lim n(n- 1) --- (n x+0(4) 
X! noo n 


n->o0 \X 


λὰ n(n-1  (n-x+1) 


- — ]im 
x!nocon n n 
λὰ 1 χ- 1 
=—1x tim 1- - [xx lim h-| 
x! n>% n n> n 
AX 
^x 


since x is finite, there are only a finite number of factors, and we can use the formula 
that the limit of a finite number of products is the product of the limits and each factor 
here goes to 1. [If it involved an infinite number of factors, then we could not have 
taken the limits on each factor.] Now let us examine the factor: 


«opor «(ας 10-2)" 


But 


from the definition of e, and 


-— 
lim (1 - 1j =1 
nco n 


since the exponent -x is finite. Therefore, 


lim ho) = lim C) p*q™* 


n—>oo,p—0,np=A noco,p—0,np-A \ x 
X 
Te, x=0,1,...,A>0 
πο xt 
0, elsewhere. 


Let us call the right side as f(x). Let us see whether fz(x) is a probability function, 
since we have done a limiting process on a binomial probability function. If you take 
some sort of limits on a probability function, the limiting form need not remain as a 
probability function. The total of f; (x) is given by 


5.6 Poisson probability law —— 119 


Hence fz(x) is a probability function and it is known as the Poisson probability law. 
Àis a parameter here. 

Let us evaluate E(x), variance and the moment generating function for the Poisson 
probability law: 


oco Xx oo x 
E(x) = > x—e?^ze^ yx-— 
— x! = x! 

x=0 x=1 


since at x = O the right side is zero. Now we take out one lambda, cancel one x, 


XX 1 
a mm when x z 0. Then we have 


oo AX 
e (x - 1)! 


2 


πλεον κ S| = Ae ^e? =A, (5.13) 


E(x) = Ae? 


Thus, the mean value in the Poisson case is the parameter À sitting in the probabil- 
ity function. Since x! is sitting in the denominator, for computing E(x’), we will go 
through the factorial moments or consider the identity 


x!-x(x-1)4x 


and proceed to evaluate E[x(x -- 1)]. This procedure has already been done several 
times before. We cancel x(x — 1) from x! since at x = 0,1 the function on the right will 
be zeros, and thus x only goes from x = 2 to infinity in the sum. Then we take out A’. 
That is, 


co x 


E[x(x - 1] = X x(x- De 


Then the variance 


σ᾽ = E[x - E(x)]? = E[x(x - 1] + Efx] - [EGO]? 
=}? +A- [A =A. (5.14) 


Thus it is interesting to see that the mean value and the variance are equal to A in 
the Poisson case. But this is not a unique property or a characterizing property of the 
Poisson distribution. There are also other distributions satisfying this property that 
the mean value is equal to the variance. 

Let us compute the moment generating function in this case 


M(t) - E[e*] - Y pun 


120 —— 5 Commonly used discrete distributions 


= elle-1l (5.15) 


The last sum is obtained by observing that it is an exponential series with Ae‘ in the 
exponent. 


Example 5.5. The monthly traffic accidents on a stretch of a particular highway is 
seen to be Poisson distributed with expected number of accidents 5. Four months are 
selected at random. What is the probability (i) in all four months the number of acci- 
dents is two or more per month; (ii) in at least one of the months the number of acci- 
dents is exactly 3; (iii) the first three months had no accidents and the fourth month 
had two accidents. 


Solution 5.5. Letx bethe number of monthly accidents on that stretch of the highway. 
Then the probability function is given as, denoted by P(x), 


Ze5, x=0,1... 
P(x) -- x! 
0, elsewhere. 


The parameter A = 5 because it is given that the mean value is 5. The probability that 
the number of accidents in a month is 2 or more, denoted by p,, is given by 


ολα 
ΡΞ rcu e 
xix xo X! 


since the total probability is 1. That is, 


50 - 51 


T jt 5d e?[145] 21- 6α”. 


p-1 

The answer to (i) is then pł. 

The probability for the number of accidents in a month is exactly 3, denoted by p;, 
is given by 

3 
= 2-5 = 125 5. 
3! 6 

In (ii), it is a binomial situation with the number of trials n = 4 and probability of suc- 
cess is p,. Hence the answer to (ii) is 


at 4 
Σαν” =1- (o)ra -a =1- (1- Bes)’ 


x=1 


P2 


Probability of having no accidents is 


5.6 Poisson probability law —— 121 


-5 _ 25 


Probability of having exactly 2 accidents is Xe : e^?. Hence the answer to (iii) is 


le*P[ Te] ΡΤ. 
2 2 


5.6.1 Poisson probability law from a process 


Consider an event taking place over time, such as the arrival of persons into a queue at 
a checkout counter, arrival of cars into a service station for service, arrival of telephone 
calls into a phone switchboard, floods in a river during rainy season, earthquakes over 
the years at a particular place, eruption of a certain volcano over time and so on. Let 
us assume that our event satisfies the following conditions: 

(i) The occurrence of the event from time t to t + At, that is in the interval [t,t + ^t], 
where At is a small increment in t, is proportional to the length of the interval or it is 
αδί, where a is a constant. Here, A (Greek capital letter delta) is not used as a product. 
At is a notation standing for a small increment in t. 

(ii) The probability of more than one occurrence of this event in [t, t At] is negli- 
gibly small and we take it as zero for all practical purposes, or it is assumed that the 
interval can be subdivided to the extent that probability of more than one occurrence 
in this small interval is negligible. 

(iii) The occurrence or non-occurrence of this event in [t,t + At] does not depend 
upon what happened in the interval [0, t] where 0 indicates the start ofthe observation 
period. An illustration is given in Figure 5.2. If the event under observation is a flood 
in a river during the rainy season, then start of the rainy season is taken as zero and 
time may be counted in days or hours or in any convenient unit. 


x-1 1 

At 
————————— 
0 t 0 t+ At aa ae 


Figure 5.2: An event taking place over time. 


Under the conditions (i), (ii), (iii) above, what is the probability of getting exactly x 
occurrences of this event in time [0, t]? This probability function depends upon x and 
the time { and let us denote it by f(x, t). Then the interpretations are the following: 
f(x,t + At) = the probability of getting exactly x occurrences of the event in 
time [0,f + At]; 
f(x -1,t) = the probability of getting exactly x — 1 occurrences in time [0,4]. 
Exactly x occurrences in the interval 0 to t + At can happen in two mutually ex- 
clusive ways of (a) exactly x — 1 occurrences in time 0 to t or in the interval [O, t], and 


122 —— 5 Commonly used discrete distributions 


one occurrence from t to t + At or in the interval [t,t + At] [probability for one occur- 
rence is aAt], or (b) exactly x occurrences in the interval [0, t] and no occurrence in the 
interval [t,t + At] [probability for no occurrence is 1 — a At]. Therefore, from the total 
probability law, 


f(x,t + At) 2 f(x - ,O[aAt] + f(x, O[1 - a At]. 


We can rearrange the terms and write 


f(x,t+ ^t) - foot) 


M - a[f(x - 1,0) -fœ 0]. 


Taking thelimit when At — 0, we get a differential equation in t or a partial differential 
equation in t. That is, 


Sf. )-a[f( -1,0-f0,8]. (516) 


Here, (5.16) is a differential equation in t whereas it is a difference equation in x. We 
have to solve this difference-differential equation to obtain f(x, t). This can be solved 
successively by taking values for x = 0 solving the equation for t, then x = 1 solving the 
equation for t, and so on. The final result will be the following: 


CO omat, a>0, 0<t<0,X=0,1,... 
0, elsewhere, 


or itis a Poisson probability law with the parameter A - at. 


Example 5.6. Telephone calls are coming to an office switchboard at the rate of 0.5 
calls per minute, time being measured in minutes. What is the probability that (a) in 
a 10-minute interval (i) there are exactly 2 calls; (ii) there is no call; (iii) at least one 
call; (b) if two 10-minute intervals are taken at random then (1) in both intervals there 
are no calls; (ii) in one of the intervals there are exactly 2 calls? 


Solution 5.6. We will assume that these telephone calls obey the conditions for Pois- 
son arrivals of calls or the Poisson probability law is a good model. We are given a = 0.5. 
In (a) t = 10, then A = 10 x 0.5 = 5 and the probability law P(x) is 


Ze5, x=0,1... 
P(x) = 4 * 
0, elsewhere. 


In (a)(i), we need the probability Pr{x = 2}. 


—-n-2e5_ 2 
ἘΠ. e 


5.7 Discrete hypergeometric probability law —— 123 


In (a)(ii), we need the probability Pr{x = 0). 


In (a)(iii), we need Pr{x > 1}. 
Príx » 1) =1-Pr{x 20] - 1- e?. 


In (b)(i), it is a case of two Bernoulli trials where the probability of success p; is the 
probability of having no arrivals in a 10-minute interval or p, = e ?. We want both trials 
to result in successes, and hence the answer is 


pi - [e κε. 


In (b)(ii), we have two Bernoulli trials and the probability of success in each trial is p; 
where p, is the probability of having exactly 2 calls in a 10-minute interval. Then 


In (b)(ii), we need the probability of getting exactly one success in two Bernoulli trials. 
Then it is given by 


2), ži [2 =] 25 =] 
1= =2|—e 1 6]. 
( 1 ) py - pj) 2 2 
Another probability law which is frequently used in probability and statistics is the 
discrete hypergeometric law. 


5.7 Discrete hypergeometric probability law 


Let us consider a box containing two types of objects: onetypeis ofa in number, which 
we will call these a-type objects, and the other type is b in number, which we will call 
these b-type objects. As an example, we can consider a box containing red and green 
marbles and suppose that there are 10 green and 8 red marbles then we may consider 
a = 10 and b = 8 or vice versa. Suppose that a subset of n items is taken at random 
from this set of a+ b objects. When we say “at random" it means that every subset of n 
has the same chance of being taken or each subset gets a probability of [57 because 
n 

there are (4:5) such subsets possible. This can also be done by taking one by one, at 
random, without replacement. Both will lead to the same probabilities. 

In this experiment, what is the probability that the sample of n items contains x 
of a-type and n - x of b-type objects. Let this probability be denoted by f; (x). Note that 
x of a-type can only come from a-type objects and this can be done in (4) ways. For 


124 —— 5 Commonly used discrete distributions 


b 


each such selection of a-type objects, we can select n — x b-type objects in (,, 
Therefore, the number of choices favorable to the event is (“)(,,,). Hence 


(en) 
fe =4 P) 


0, elsewhere. 


κ) Ways. 


x=0,1,...,nora; a,b=1,2,..., n=1,2,... 


This is known as the discrete hypergeometric probability law. a,b,n are parameters 
here. 
First, let us check to see the sum: 


κο (κ) (κ κ) 


2190) De (eth) =1 


because, from Section 3.3 we have 


20b) (5.17) 


Thus the total probability is 1 as can be expected. What are the mean values and vari- 


ance in this case? 
1 & /a b 
ο ος) 
Gur xJAn-x 


When x = 0, the right side is zero, and hence the sum starts only at x = 1. Then one 
may cancel one x from the x!. That is, 


a a! (a-1)! E 
X =X =a =a H 
X x!(a — x)! (x - 2)!((a — 1) - (x - 2)! x-1 


Hence, taking the sum by putting y = x — 1 so that y goes from 0, and by using (5.17), 


we have 
na 
ay (5 b J-4(25:). 
yoo \ Y n-1-y n-1 


Now, dividing by (^*^) and simplifying we get 


n 


na 


E(x)- 4 (5.18) 
a+b 
By using the same steps, the second factorial moment is given by 
n(n-1)a(a-1) 
E 1} - s 5.1 
ο SO EB EDU T 
Now, variance is available from the formula 
Var(x) = E[x(x -- 1] + E(x) - [E09]? 
E E 2,2 

_ n(n - 1)a(a - 1) x na nta (5.20) 


" (a+b\a+b-1) (a+b) (a+b) 


5.8 Other commonly used discrete distributions —— 125 


Example 5.7. From a set of 5 women and 8 men a committee of 3 is selected at random 
[this means all such subsets of 3 are given equal chances of being selected]. What is 
the probability that the committee consists of (i) no woman; (ii) at least two women; 
(iii) all women? 


Solution 5.7. Let x be the number of women in the committee. Then x is distributed 


according to a discrete hypergeometric probability law. In (i), we need Pr{x = 0}: 


(06) (0X6 _ 28 
(8)  qa302üD 14x 


Pr{x = 0} = 


In (ii), we need Pr{x = 2 or 3} = Pr{x = 2} + Pr{x = 3}. In (iii), we need Pr{x = 3}. Let us 
compute these two probabilities: 


4. BG). GMB) 5 
end (8)  a3)00201) 143 

4 GG) σώ. (26) 40 
UNE (9) MQ (8) 13)1201) 143 


3 
40 | 5 45 
143 143 ~ 143° 


Hence the answer in (ii) is 


5.8 Other commonly used discrete distributions 


Here, we list some other commonly used discrete distributions. Only the non-zero part 
of the probability function is given and it should be understood that the function is 
zero otherwise. In some of the probability functions, gamma functions, I'(-) (T is the 
Greek capital letter gamma) and beta functions, B(.,-) (Bis the Greek capital letter beta) 
appear. Hence we will list the integral representations of these functions here. Their 
definitions will be given in the next chapter. Only these integral representations will 
be sufficient to do problems on the following probability functions where gamma and 
beta functions appear. 
The integral representation for a gamma function, Γ(α), is the following: 


Γ(α) - | x*1le%dx, m(a)»0. (5.21) 
0 
For the integral to converge, we need the condition that the real part of a (alpha) is 


positive, R(-) means the real part of (-). 


Note 5.1. Usually in statistical problems the parameters are real but the integrals will 
exist in the complex domain also, and hence the conditions are written in terms of real 
parts of the complex parameters. 


The beta function, Βία, B), can be written in terms of the gamma function. In the 
following, we give the connection to gamma function and integral representations for 


126 —— 5 Commonly used discrete distributions 


beta functions: 
- FGOT(By [T au. ka 
Βία, β) = Tac) E (1 - xy-!dx, (5.22) 
- N y"!ü«y) A, R(a)>0, m(f)»0. (5.23) 
0 


For the convergence of the integrals in (5.22) and (5.23), we need the conditions 
R(a) > 0 and R(B) 50, (B is the Greek small letter beta). It may be noted that 


Βία, β) = B(B, a). (5.24) 


That is, the parameters a and f can be interchanged in the integrals: 


f= (") T(a 4 β)Γ(Χ 1- a)T (n 4 B — x) 


χ Γ(α)Γ(β)Γ(Π * a+ p) 
for x =0,1,...,n; a» 0, B > 0 [Beta-binomial probability function]. 


I(r - ST(x*n-r- s)r(xr(n) 
T(r)F(s)T (x - r * 2)T(n - S)T(x +n) 


fg (x) = 
for x =r,r +1,..., S> 0, n> s; r a positive integer [Beta-Pascal probability function]. 
k n 
folx)= Dwi MET οἱ 
i=l 
for x = 0,1,...,n; O < p; <1, w; 20, i=1,...,m; Συν! = 1 [Mixed binomial probability 


function]. 


(pep 
1-(1-p)" 
for x = 1,2,...,n; O < p < 1; (truncated below x = 1) [Truncated binomial probability 

function]. 


fio) = 


> 


-1 
opr e 1 


fu) = x=1,2,...;B>0 


[Borel probability law]. 


r 
(x - r)! 


x-r-lgx-rg-ax 


fro) = 


> 


for x =r,r+1,...;a@>0, where r is a positive integer [Borel- Tanner probability law]. 


T(v+x)I(A) μὲ 


fs) = Της ο) (Εν) 


for x = 0,1,2, ...; v > 0, À > 0, u > O where ,F, is a confluent hypergeometric function 
(v is Greek letter nu; u is Greek letter mu) [Confluent hypergeometric probability law]. 


5.8 Other commonly used discrete distributions —— 127 


N 
feo -Wi o ría ΞΡΙ ΣΧ ds νν»φίχ) 


for x = 0,1, ..., N; w, =1-w,,0<w,<1,0<p, «1and φίχ) are some probability func- 
tions [Dodge-Romig probability law]. 


πο (DT 


for x =0,1,...,c;0 < p <1; N,c positive integers [Engset probability law]. 


a(x)0* 
[3,44 a600*] í 


for a(x) > 0, x € A = subset of reals [Generalized power series probability function]. 


fig(X) = θ50 


Γία--β) Γία * x - DT(B 11) 
Γ(α)Γ(β) T(a+B +x) 


for x =1,2,...; a> 0, B > 0 [Compound geometric probability law]. 


EP! V A" 2m X(1 9 2n-x 
Λεία) =e 2 = ( ν )p (1- p) 


fiz) - 


for x =0,1,2,...,2m;A>0, 0 < p «1 [Hermite probability law]. 


Γ(λ)θΧ 

τ 

for x = 0,1,2, ...; 0 > 0, A > 0 [Hyper-Poisson probability function]. 
0* 

Be 


for x = 1,2, ... ,d where f = 3 E [Truncated logarithmic series probability function]. 


foo(X) = 0«0«1, 


k k 
fa 60 = È wifi, O<w;<1, Ywi-1 
i=l i=l 
where [(χ) is a general probability or density function for each i = 1,...,k [General 
mixed probability function]. 


I(rtx) γη ux 
xr ? 4 p) 


folx) = 


for x =0,1,...;0<p<1,r> 0 [Negative binomial probability function, model-2]. 


a 1)Ρ-αΓ 
α (a 4 1) ALUKA 


fs) = rex (a Pe 


1 
App I) 


for x 20,1,..5a»0,p»0,a» 0 and ;F, is a Gauss’ hypergeometric function [Gener- 
alized negative binomial probability function]. 


128 —— 5 Commonly used discrete distributions 


fax) = Ce? Pee k - P 


for x = 0,1, ...; À,c positive constants [Neyman type A probability function]. 


x+r-1 
fos (X) = ( x )ra-» 
for x =0,1,...;0 < p < 1; r a positive integer [Pascal probability law]. 


fg =e" δ T (Tp Capen 


for x = 0,1, ...,nm; a > 0,0 < p < 1; n,m positive integers [Poisson-binomial probability 
function]. 


μὲ 
[00 = LE __ 
ΖΓ xMfexp(y) - 1] 
for x =1,2,...; u > O (truncated below x = 1) [Truncated Poisson probability function]. 


scs NA Γία-- B)r(a -xT(B +N - x) 
μα Γ(α)ΤΓ(β)Γ(α B +N) 


for x = 0,1,..., N; a > 0, B > 0 [Polya probability law or Beta-binomial probability func- 
tion]. 


fo - tB Gu 


for x 20,1,...5d > 0, h > 0 [Polya-Eggenberger probability function]. 
1 
Ἱρία)ξ-, Χ-Χι,...»Χῃ 
n 
[Discrete uniform probability function]. 


[o (x)- exp[ ας ας φγ Kerc θγ]ΐ 


for x =0,1,...;@ > 0, 8 > 0, À > O [Short's probability law]. 


co 


1 P(t) 
[1-P(0)] rf 


for x = 0,1,...; where P(t) is any discrete probability function over the range t = 
0,1, 2, ... [Ster’s probability function]. 


fz20) = [dox]! 


for x = 1,2,... where ¢(k) = X% t¥,k>1 (¢ is the Greek letter zeta) [Zeta probability 
function]. 


faa) = 


5.8 Other commonly used discrete distributions —— 129 


Exercises 5 


5.1. Compute E(x?) for the geometric probability law by summing up or by using the 
definition, that is, by evaluating 


5.2. Compute (i) E(x); (ii) E(x’); for the negative binomial probability law by using the 
definition (by summing up). 


5.3. Compute (i) E(x); (ii) E(x?); by using the technique used in the geometric proba- 
bility law by differentiating the negative binomial probability law. 


5.4. Compute E(x) and E(x?) by differentiating the moment generating function in the 
Poisson probability case. 


5.5. Compute E(x) and variance of x by using the moment generating function in the 
binomial probability law. 


5.6. Construct two examples of discrete probability functions where E(x) = Var(x). 


5.7. Solve the difference-differential equation in (5.16) and show that the solution is 
the probability function given therein. 


5.8. Show that the functions f; (x) to f33(x) in Section 5.8 are all probability functions, 
that is, the functions are non-negative and the sum in each case is 1. 


5.9. Forthe probability functions in Exercise 5.8, evaluate the first two moments about 
the origin, that is, E(x) and E(x?), whenever they exist. 


Truncation. In some practical problems, the general behavior of the discrete ran- 
dom variable x may be according to a probability function f(x) but certain values 
may not be admissible. In that case, we remove the total probability masses on the 
non-admissible points, then re-weigh the remaining points to create a new probabil- 
ity function. For example, in a binomial case suppose that the event of getting zero 
success is not admissible. In this case, we remove the point x = 0. At x = 0, the prob- 
ability is (3)p9(1— p)? =  -- p)". Therefore, the remaining mass is cy = 1- (1- p)". 
Hence if we divide the remaining probabilities by ορ then the remaining points can 
produce a truncated binomial probability law, which is 
digi joe x=1,2,...,n,0<p<1 
0, elsewhere. 

Here, g(x) is called the truncated binomial probability function, truncated below x = 1 
or at x = 0. Thus truncation is achieved by multiplying the probability function by an 
1 


appropriate constant c. In the above case, it is c = m 


130 —— 5 Commonly used discrete distributions 


5.10. Compute the truncation constant c so that cf (x) is a truncated probability func- 

tion of f (x) in the following cases: 

(i) Binomial probability function, truncated below x = 1 (Here, c = a where ορ is 
given above); 

(ii) Binomial probability, truncated at x =n; 

(iii) Poisson probability function, truncated below x = 1; 

(iv) Poisson probability function, truncated below x = 2; 

(v) Geometric probability function, truncated below x = 2; 

(vi) Geometric probability function, truncated above x = 10. 


Probability Generating Function. Consider a discrete random variable taking non- 
zero probabilities at the points x = 0,1, ... and let f(x) be the probability function. Con- 
sider the expected value of t* for some parameter t. Let us denote it by P(t). Then we 
have 


οο 
P(t) = E(t*) = Y δα) (5.25) 
x=0 
where, for example, the probability that x takes the value 5 is Pr{x = 5} = f(5) or it is 
the coefficient of t? on the right side of (5.25). Thus the various probabilities, such as 
Pr{x = 0), Pr{x = 1}, ... are generated by P(t) or they are available from the right side se- 
ries in (5.25), provided the right side series is convergent. In the case when x = 0,1,2, ... 
with non-zero probabilities then P(t) in (5.25) is called the generating function for the 
probability function f(x) of this random variable x. We can also notice further prop- 
erties of this generating function. Suppose that the series on the right side in (5.25) or 
P(t) is differentiable, then differentiate with respect to t and evaluate at t = 1, then we 
get E(x). For example, 


d d 2 
— P(t = — t* 
dt οι, "PA ol 


=Po =} ἀγα) SED. 
x=0 t=1 Xx=0 


Successive derivatives evaluated at t = 1 will produce E(x), E[x(x-1)], E[x(x-1)(x-2)] 
and so on, when P(t) series is uniformly convergent and differentiable term by term. 


5.11. Compute the (a) the probability generating function P(t), (b) E(x) by using P(t), 
(ο) E(x’) by using P(t) for the following cases: (i) Geometric probability law; (ii) Nega- 
tive binomial probability law. 


5.12. A gambler is betting on a dice game. Two dice will be rolled once. The gambler 
puts in Rs5 (His bet is Rs 5). If the same numbers turn up on the two dice, then the 
gambler wins double his bet, that is, Rs 10, otherwise he loses his bet (Rs 5). Assuming 
that the dice are balanced 


5.8 Other commonly used discrete distributions —— 131 


(i) What is the gambling house's expected return per game from this gambler? 
(ii) What is the probability of the gambler winning exactly five out of 10 such games? 
(iii) What is the gambler's expected return in 10 such games? 


5.13. Cars are arriving at a service station at the rate of 0.1 per minute, time being 
measured in minutes. Assuming a Poisson arrival of cars to this service station, what 
is the probability that 
(a) ina randomly selected twenty minute interval there are 
(i) exactly 3 arrivals; 
(ii) atleast 2 arrivals; 
(iii) no arrivals; 
(b) if 5 such 20-minute intervals are selected at random then what is the probability 
that in at least one of these intervals 
(i) (a)(i) happens; 
(ii) (a)(ii) happens; 
(iii) (a)(iii) happens. 


5.14. The number of floods in a local river during rainy season is known to follow a 
Poisson distribution with the expected number of floods 3. What is the probability that 
(a) during one rainy season 

(i) there are exactly 5 floods; 

(ii) there is no flood; 

(iii) at least one flood; 
(b) if 3 rainy seasons are selected at random, then none of the seasons has 

(i) (a)(i) happening; 

(ii) (a)(ii) happening; 

(iii) (a)(iii) happening; 
(c) © (a)(i) happens for the first time at the 3rd season; 

(ii) (a)(iii) happens for the second time at the 3rd season. 


5.15. From a well-shuffled deck of 52 playing cards (13 spades, 13 clubs, 13 hearts, 
13 diamonds) a hand of 8 cards is selected at random. What is the probability that the 
hand contains (i) 5 spades? (ii) no spades? (iii) 5 spades and 3 hearts? (iv) 3 spades 
2 clubs, 2 hearts, 1 diamond? 


6 Commonly used density functions 


6.1 Introduction 


Here, we will deal with the continuous case. Some most commonly used density func- 
tions will be discussed here and at the end a few more densities will be listed. The very 
basic density function is the uniform or rectangular density as shown in Figure 6.1. 
This was already introduced in Example 4.4 in Chapter 4 and the mean value and vari- 
ance were evaluated there. For the sake of completeness, we will list here again. 


6.2 Rectangular or uniform density 


0, otherwise. 


sa- |B a<x<b 


The graph looks like a rectangle, and hence it is also called a rectangular density. The 
total probability mass 1 is uniformly distributed over the interval [a,b], b > a, and 
hence it is called a uniform density. 


b Figure 6.1: Uniform or rectangular density. 


The probability that the random variable x falls in the intervala<c<x<d<bis 
marked in the graph. It is the area under the curve between the ordinates at x = c and 
x = d. It was shown in Example 4.4. that 


E(x) = Dead and  Var(x)- t =a)" (6.1) 
2 12 
The moment generating function in this case is the following: 
M(t) = Ele] = | e* dic 
a b-a 
1 b οδί 4 et 
lool Wea’ dd in 


One can also obtain the moments by differentiating as well as by expanding M(t) here. 
For example, let us evaluate the first two moments by expanding M(t): 
2 


Mis aas LE | [iva I b 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. ABARH] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https://doi.org/10.1515/9783110562545-006 


134 —— 6 Commonly used density functions 


" 1 

~ t(b- a) 

" (b-a?) t , (ο) --αἲ P HE 
(b-a) 2! (b-a) 3! 

Otat (b? + ab + a?) t? 


fo a um dE 


=1 + + 
2 14 3 2! 
Since 
peg 3. 3 
Te όν and Tia cb eabea! 
-a -a 
we have the coefficient of a as Ê 29) and the coefficient of 5 as ας Hence 


_ (a? « abs b?) 


E(x)- 3 


and E(x’) 


(a : p) (6.3) 


in the uniform distribution. 


Example 6.1. Example of a random cut or a point taken at random on a line segment 
was considered in Chapters 1 and 2. Consider the problem involving areas. A girl is 
throwing darts at a circular board of diameter 2 meters (2m). The aim is to hit the 
center of the board. Assume that she has no experience and she may hit anywhere 
on the board. What is the probability that she will hit a specified I x I square meters 
region on the board? 


Solution 6.1. Let dA be an infinitesimal area on the board around the point of hit. Due 
to her lack of experience, we may assume that the point of hit is uniformly distributed 
over the area of the board. The area of the board is zt? = 711? = 1 square meters, or zt m°. 


Then we have the density 


dA 


f(A)dA - | = 


ο-«Α-π 


0Ο, elsewhere. 


The integral over the specified square will give the area of the square, which is, 
1,1.1 


-- 2 . ΠΗ . 

5 X 5 = 7 m^. Hence the required probability is 
11_1 

4m An 


Several densities are connected with a gamma function. Hence we define a gamma 
function next. 


Notation 6.1. T(a): gamma function. 


Definition 6.1 (A gamma function). A gamma function, denoted by T (a), exists for 
all values of a, positive, negative, rational irrational, complex values of a except for 


6.2 Rectangular or uniform density —— 135 


a=0,-1,-2,.... A gamma function can be defined in many ways. Detailed defini- 

tions and properties may be seen from the book [2]. But when defining densities, we 
will need only an integral representation for a gamma function. Such an integral 
representation is given below: 


co 


Τ(α) = Ji x*le*dx, R(a)>0. (6.4) 


This integral exists only for real value of a greater than zero. In statistical problems, 
usually a is real and then the condition will become a » 0. Other integral represen- 
tations are available for a gamma function, each with its own conditions. We will 
make use of only (6.4) in this book. Two basic properties of the gamma function 
that we will make use of are the following: 


Γ(α) -(α-- 1)I (a - 1) (6.5) 
when Γία -- 1) is defined. Continuing the process, we have 


I(a) - (a - 1)(a - 2) --- (a - r)T(a - r) (6.6) 


when TI'(a - r) is defined. This property can be seen from the integral representation 
in (6.4) by integrating by parts, by taking dv = e * and u = χα ! and then using the 
formula f udv = uv — f vdu [Verification is left to the student]. From (6.6), it follows 
that if a is a positive integer, say, n = 1,2,3, ... then 


T(n)=(n-1)!, n=1,2,.... (6.7) 


The second property that we will use is that 
(5)- vz (6.8) 


This will be proved only after considering joint distributions of more than one random 
variable in the coming chapters. Hence the student may take the result for granted for 
the time being. Note that the above results (6.5), (6.6) are valid for all a + 0,-1, —2, ..., 
a need not be a positive number or R(a), when a is complex, need not be positive. For 
example, 


8 
ie 


136 — 6 Commonly used density functions 


(- 


GXGGYG)- 7) 


ΓΩ.8) = (2.8)(1.8)(0.8)1(0.8). 
Γ(1.3) = (0.3)1(0.3). 
T(1.3) = (0.3)(-0.7)(-1.7: (21.7) 2(0.357T(-17) = 


By using the above procedures, one can reduce any gamma function Γ(β), with f real, 
ίοα Τ(α) whereO «a < 1, and Τ(α) is extensively tabulated when 0 < a < 1, such numer- 
ical tables are available. A density associated with a gamma function of (6.4) is called 
a gamma density. A two-parameter gamma density is defined next. 


6.3 Atwo-parameter gamma density 


1 a-la-5 
ο. 65, Oxx«oco,fp»50,a»0 


0, elsewhere. 


Since the total probability is 1, we have 


1- ad > 


-f aos 5 dx > 
a. [2 χα le dx 
B -f7 EE (6.9) 


This (6.9) is a very useful representation where the only conditions needed are f) > 0 
and a > 0, then we can replace β΄ by a gamma integral. 

From Figure 6.2, note that the parameter β has a scaling effect, and hence β is 
called the scale parameter and a is called the shape parameter in the gamma case 
because a throws light on the shape of the density curve. Gamma density is one of the 
main densities in probability and statistics. Many other densities are associated with 
it, some of which will be listed later. An extended form of the two-parameter gamma 
density is available by replacing x by |x| so that the mirror image of the graph is there 
on the left of the y-axis also. An extended form of the gamma density function is then 
given by 


fico -l xte? 


, -οο«χ«οο,β»Σ0,αΣ0. 
mE β 


One graph for one set of parameters a and f is given in Figure 6.3. 


6.3 Atwo-parameter gamma density —— 137 


f(x) li 
| f(x) 


ab x— a b κ--» 


Figure 6.2: Gamma density: (a) varying B, fixed a; (b) Varying a, fixed β. 


t f(x) 


FP, 0 © Figure 6.3: An extended form of the gamma density. 


Let us evaluate arbitrary (s -- 1)-th moment and the moment generating function of the 
gamma density for x > 0. Observe that (s -- 1)-th moment is also the Mellin transform 
of the density function: 

1 


Εκ» 1] = FIG [κ x* 15e βάχ. 


Substitute 
χ 
y=- => dx-fdy. 
B 
Then 


sue TOPNE 
Efx |-[ πα * Αν 
pr 


= (e*s-1)-16-yg. : 
Γία) i y * 


But this integral is a gamma function and it is 


Tr(a * s - 1) 


= pr , R(a+s-1)>0. 6.10 
B Ta) ( ) (6.10) 
From this general moments, we can obtain all integer moments also. Put s = 2 to obtain 


E(x) and s = 3 to get E(x): 


138 —— 6 Commonly used density functions 


.1I(a-s- 1) 
E = ps 1 
aoe Γία) 5-2 
| gl(a*1) , I(a) _ 
β TG) Ba Πα) αβ (6.11) 
by using the formula (6.5). 
A ,I(a*2) κ 2 Τία) _ 2 
E(x^) - B "e = f(a 1 1)(a) a ala + 1)f?. (6.12) 


The reduction in the gamma is done by using (6.6). Then the variance for a gamma 
random variable is given by 


Var(x) = E[x?] - [EGO]? = ala Df? -- (aB)* = af?. (6.13) 


Note that the mean value is αβ and the variance is αβ’, and hence if f = 1 then the 
mean value is equal to the variance in this case also, just like the Poisson case as seen 
from Chapter 5. 

The moment generating function M(t) in the gamma case is the following: 


Ble |= |" ελῃρὐάκ- ae [e^ 
Put the exponent as -y = -[ 2 - t]x, which gives dy = [4 B^ t]dx and integrate to obtain 
[s -t] * 
M(t) = πρ -(1- t) for(1- t) » 0. (6.14) 


This condition is needed for the integral to be convergent, otherwise the exponent in 
the integral can become positive and the integral will give --οο or the integral diverges. 
When a random variable x is gamma distributed with two parameter gamma density 
as f,(x) above then we write x ~ gamma(a, p) where 


Π 


‘x ^" stands for “x is distributed as" 


What is the probability that the gamma random variable x in f,(x) lies over the 
interval [a,b], a > 0, b > a? This is given by the area under the gamma curve between 
the ordinates at x = a and x = b. This can be obtained by integrating f,(x) from a to b. 
This is shown in Figure 6.4. 

The integral from a to b is the same as the integral from 0 to b minus the integral 
from 0 to a. Let us see what is such an integral. For example, what is the integral from 
0 to a: 


a B 1 a dz ox 
| roa- xs |, e αχ 
= alf yt A) x 
| τῶ" η β 
1 α 
= oe a) (6.15) 


6.3 Atwo-parameter gamma density —— 139 


where y(.,-) is the incomplete gamma and this incomplete gamma is tabulated and nu- 
merical tables are available. Thus, for evaluating probabilities one has to use incom- 
plete gamma tables if a + 1. When a = 1, then it is an exponential density which can 
be integrated and the probabilities can be evaluated directly. If a is a positive integer, 
then also one can integrate by parts and obtain explicit forms. 


f(x) 


a b mX Figure 6.4: Probability in the gamma case. 


Some special cases of the gamma density are the following. 


6.3.1 Exponential density 


One parameter exponential density was dealt with in Chapters 3 and 4. This is avail- 
able from the gamma density f,(x) by putting a = 1. That 15, 

ρω- ia 0<x<co, B>0 

0, elsewhere. 

The moments and moment generating function are available from the corresponding 
quantities for the gamma case by putting a = 1. Exponential density is widely used as 
a model to describe waiting time, such as waiting in a queue, waiting for a scheduled 
bus etc. But it may not be a good model for all types of waiting times. If the waiting 
time consists of several components of waiting times such as waiting for completing 
a medical examination at a doctor's office, which may consist of blood test, physi- 
cal examination, checking weight and height, X-ray, etc., then the individual compo- 


nents may be exponentially distributed but the total waiting time is usually gamma 
distributed. 


6.3.2 Chi-square density 


In a gamma density when B = 2 and a = 5, n = 1,2, ..., then we obtain a chi-square 


density with n degrees of freedom. The density is the following: 


140 — 6 Commonly used density functions 


pl—x3le, 0«x«oo n-12,... 
fico = 2215) 


0, elsewhere. 


The meaning of "degrees of freedom" will be given when we consider sampling dis- 
tributions later on. For the time being, the student may take it as a parameter n in 
f, CO, taking integer values 1,2, .... Chi-square density is widely used in statistical de- 
cision making such as testing of statistical hypotheses, model building, designing of 
experiments, regression analysis, etc. In fact, this is one of the main distributions in 
statistical inference. A chi-square random variable with k degrees of freedom is de- 
noted by xz. 


Notation 6.2. χ]: chi-square with k degrees of freedom. 


Definition 6.2 (Chi-square random variable). A chi-square random variable, with 
k degrees of freedom is a gamma random variable with the parameters a = : and 
ο ρα ρε, 


Example 6.2. The waiting time for the first pregnancy among women in a certain 
community from the time of marriage or cohabitation is found to be gamma distributed 
with scale parameter f = 1 and the expected waiting time 3 months, time being mea- 
sured in months. (i) If a freshly married woman from this community is selected at 
random, what is the probability that she has to wait at least eight months before she 
gets pregnant? (ii) If three freshly married women are selected at random from this 
community, then what is the probability that at least two of them have to wait at least 
eight months to get pregnant? 


Solution 6.2. The expected value in the gamma case is found to be af and if 6 = 1 
then a is given to be 3. Waiting for at least eight months means that the waiting time 
t > 8. Hence we need p = Pr{t = 8}. That is, 


eor. | 
= Pr{t>8}= | — x7 ledt 
p=Pr{t = 8} ο IG 


1 o 
52 | x'e*dx 
2 Ja 


since fi^ = 1, T(3) = 2! = 2. Since a is a positive integer, we can integrate by parts by 
taking dv = e^* and u as x’. That is, 


p= J x e™*dx 
2 Jg 


at [Hee 2 Γ ved 


2 8 


1 E -x100 -χ]οο 
= 5 64e 8 -2[xe *]? -2[e *]; 


6.4 Generalized gamma density —— 141 


1 
= 82e ? = 41e 8, 
2 
This answers (i). For answering (ii), we consider three Bernoulli trials with probability 
of success p above and the required probability is the probability that the number of 
successes is 2 or 3. That is, 


3 
2 (ra -p)™ = ()r'a -p)+ (ον -3p'ü-p)«p! = p!G - 2p). 


x-2 


Another density, having connection to a gamma function is the generalized gamma 
density. 


6.4 Generalized gamma density 


f) exe hs 0, a>0, 6>0, x20 
Χ)Ξ 
Š 0, elsewhere 


where c is the normalizing constant, which can be evaluated by using a gamma inte- 
gral. Make the substitution 


He 11 
y=bx = x=(2) > dx= rigo dy 


The above transformations are valid since x > 0. The limits will remain the same. 


oo 1 1 [o9] a 
| xe dx= δ pals) | yrs 
= 6 p75 r( 5 ) (6.16) 


The conditions a » 0, ὃ » 0 are already satisfied. Since the total integral is 1 (total 
probability) the normalizing constant 


(6.17) 


This is one generalized form of the gamma density. One form was introduced in Sec- 
tion 6.4. The form above in Section 6.4, is usually known in the literature as the gen- 
eralized gamma density. When 6 = 1, we have the two-parameter gamma family. 
When a = ὃ, we have a popular density known as the Weibull density, which is also 
available by using a power transformation in an exponential density. When a - 6, the 
normalizing constant in (6.17) becomes c = 6b. When a = 3, 6 = 2, we have one form 
of Maxwell-Boltzmann density in physics. When 6 = 1, a an integer we have Erlang 
density. When 6 = 2, a = 2, we have the Rayleigh density. Many more such densities 
are particular cases of a generalized gamma density. 


142 — 6 Commonly used density functions 


6.4.1 Weibull density 


This is the special case of the generalized gamma density where 6 - a. 


Pes 6bx eh > 0, 6 >0, 0< x < 00 
χ) 
$ ! elsewhere. 


This density is widely used as a model in many practical problems. Over a thousand 
research papers are available on the application of this model. 
Another function associated with a gamma function is the beta function. 


Notation 6.3. Βία, β): Beta function. 


Definition 6.3. A beta function B(a, B) is defined as 


Γ(α)Γ(β) 


ΜΡ Tr(a +p)’ 


R(a)>0, R(B)>0. (6.18) 


As in the case of gamma function, beta function can also be given integral represen- 
tations. 


Βία,β) = [a -xf-dx, R(a)>0, RB) >0 
0 
= [ra -y*!dy R(a)>0, π(β)»0 
0 
- B(B, a). (6.19) 


The second integral is available from the first by putting y = 1 — x. These two integrals 
are called type-1 beta integrals. 


B(a, β) = L 211142) @Bdx, Ra)>0, m(B)50 


p Γ uiu Ddy, R(a)>0, R(f)»0 
0 


= B(f, a). (6.20) 
These are called type-2 beta integrals. 


Integrals in (6.20) can also be obtained from (6.19) by using the substitution 
z= ix and the last integral from (6.20) by the substitution u = Σ. Transformation of 
variables will be discussed after introducing joint distributions in the next chapter. 
One variable transformation will be discussed at the end of the present chapter. Con- 
nections of beta integrals to gamma integral will be considered after introducing joint 


distributions. We will introduce beta densities associated with these beta integrals. 


6.5 Beta density —— 143 


6.5 Beta density 


There are two types, type-1, and type2 which is also called “inverted beta density”. 
But we will use the terminology “type-2 beta" instead of inverted beta. The type-1 beta 
density is associated with the type-1 beta integral. 


T(a+B) ,a-1(4 _ νγβ-Ί 

fie Für (1-x)»^, O<x<1,a>0,B>0 
0, otherwise. 

A family of densities is available for various values of the parameters a and f. When 

βΞ 1, it behaves like Χά! and when a = 1 it is of the form (1 - xf, and both of these 

are power functions. A few of them are shown in Figure 6.5. 


f(x) 


9 x, Figure 6.5: Type-1 beta densities. 


Let us evaluate an arbitrary h-th moment of a type-1 beta random variable. 


μας. _ T(a+B) μα 
E[x^] = [x f; eodx = ror) E (4 - xy"-!dx 


which is available from type-1 beta integral by replacing a by a + h. That is, 


E[x^] . T(a« B) T(a € h)T(B) 


Tare rashes ee 
= ET) TCP, Rath) >0, (6.21) 


Γ(α) T(a+B+h) 


If h is real, then it is possible to have some negative moments existing such that 
a + h > 0. If a = 5.8, then h can be down to —5.8 but not equal to -5.8. When h =s - 1, 
we have the Mellin transform of the density f7(x). The moment generating function 
will go into a series unless a and f are positive integers. Hence we will not consider the 
moment generating function here. [For obtaining the series, expand e* and integrate 
term by term. It will go into a hypergeometric series of the „F; type.] From (6.21), the 
first two moments are available by putting h = 1 and h = 2 and then simplifying by 


144 —— 6 Commonly used density functions 


using the formulae (6.5) and (6.6), and they the following, by observing that: 
r(a+1)_ Ta) _ 


Ia) τα ᾽ 
Γία +2) _ 
and TG) ^ ala * 1) 


and similarly for other gamma ratios. That is, 


Elx] = T(a+1) I(a-p) 
T(a) T(a+B+1) 
ED. 
= "T 
_T(a+2) I(a-p) 
T(a) T(a+B+2) 
αία - 1) 


~ (a+ Byatp+l) 


By using (6.22) and (6.23), one can evaluate the variance of a type-1 beta random vari- 
able. The type-2 beta density is associated with the type-2 beta integral and it is the 
following: 


(6.22) 


E[x?] 


(6.23) 


T(a)I (B) 


ρω rap caig 4 x)-(@*B), α50,β50,Χ20 
8x) = 
, otherwise. 


Various shapes are there for various values of a and £, a few are shown in Figure 6.6. 


f(x) 


x— Figure 6.6: Family of type-2 beta densities. 


In a type2 beta density if we make a transformation x = ZF , m,n =1,2,..., then we 
get the F-density or the variance ratio density, which is one of the main densities in 
statistical applications. Both type-1 and type-2 beta random variables are connected 
to gamma random variables. Let us evaluate the h-th moment for an arbitrary h in the 
type-2 beta case. 


hj) _ I(a +p) x ax h-1 - (ap) 
E[x"] = Tar) [ χ (14 x) dx 


6.5 Beta density —- 145 


which is available from a type-2 beta integral by replacing a by a + h and B by B- h 
. T(a +p) I(a-€ h)T(B -- h) 
Γ(α)Γ(β) T(a+B) 


ο T(a+h) Τ(β-- h) : 
πα πρ R(a-h)»0, m(B-h)»0. (6.24) 


Thus, the effective condition on h is that -R (a) < R(h) < R(). If a, B, h are real, then 
-a < h < $. Thus only a few moments in this interval will exist. Outside that, the mo- 
ments will not exist. Let us look into the first two moments. Observe that 


I(a*1) 
ra) 
I(a +2) ETE 
Γία) 
re). 1 
=—-——_, 1; 
ne ρου 
I8-2). 1 15 
ID g-e- ^*^^ 
Hence 
Γία 1-1) I(B - 1) a 
ΕΙχ]-- - ; 1. 6.25 
ονομα; in 
E[x] 7 T(a+2)T(B-2) a(a*1) B+1,2. (6.26) 


τα) τρ (β-Ό(β-2)} 


By using (6.25) and (6.26), one can compute the variance of a type-2 beta random vari- 
able. 


Example 6.3. The proportion of people who are politically conscious from village to 
village in Tamilnadu is seen to be type-1 beta distributed with the parameters a = 2.5 
and B = 3.5, that is, x ~ type-1 beta(a = 2.5, B = 3.5). If a village is selected at random, 
what is the probability that the proportion of politically conscious people in this vil- 
lage is below 0.2. 


Solution 6.3. The required probability, say p, is available from the area under the 
curve between the ordinates at x = 0 and x = 0.2 or from the integral 


= I(6) s 25-1(4. y)3.5-1 
= Fass) [ χο (1 -- x)?? αχ 
= I(6) c 1501 — x)25 dx 
~ I2.5)E(3.5) ik ia ` 


The gammas can be simplified. 


146 —— 6 Commonly used density functions 


τος. 
τ. CUBO Q)- 


T(6) 256 
Γ(25)Γ(3.5) 37` 


Therefore, 


But the integral part cannot be explicitly evaluated. If B was a positive integer, then 
we could have expanded (1 -- x)?! by using a binomial expansion and integrate term 
by term to produce a sum of a finite number of terms. If a was a positive integer, then 
one could have transformed x = 1 — y [the limits will change] and expanded (1 - γγά-1, 
which would have also produced a finite sum. Our exponents here are a - 1 = 1.5 and 
B —1 - 2.5, not integers. Then what we can do is either expand (1 -- x)?? by using a 
binomial expansion, then integrate term by term, which will give a convergent infinite 
series or one can use what is known as the incomplete beta tables. Integrals of the 


type 
a 
B, (a.p) = | x^ ΧΡ dx (6.27) 
0 
are tabulated for various values of a,a, f, called incomplete beta tables, or use a pro- 
gram such as Maple or Mathematica, which will produce the numerical answer also. 
In (6.27), if we expand (1 -- x)*"!, integrate term by term, then we will get a hypergeo- 


metric series of the „F; type. We will leave the final computation as an exercise to the 
students. 


6.6 Laplace density 


A density which is a good model to describe simple input-output situations, opposing 
forces, creation of sand dunes etc is the Laplace density or double exponential density: 
-θΙάΙ 
ce"! -oco«x«oo, 050 
f. 9 (x) = 
0, elsewhere 


where c is the normalizing constant. Let us evaluate c. Note that 


-x forx«O 
|x| = 
x forx>0. 


The total probability or the total integral should be 1. Therefore, 


6.7 Gaussian density or normal density, —— 147 


oo [0] co 
1-c| e *ax- c | eFax «c | e AK dx 


-οο -00 0 


0 co co co 
-c | 6-θί-χ)ᾷχ +e | e dx =c | e dy «c | e dx 
=o 0 0 ο 


by changing y = -x in the first integral 
ex 1 DE 
122c e*tat- 2c| -7e | =—, 
| 0 0 0 
Therefore, c = 0/2. The graph of the density is given in Figure 6.7. 


-00 00 Figure 6.7: Laplace density. 


Laplace density is widely used in non-Gaussian stochastic processes and time series 
models also. 


Example 6.4. For a Laplace density with parameter 0 = 2, compute the probability 
over the interval [-3, 2]. 


Solution 6.4. From the graph in Figure 6.7, note that Laplace density is a symmetric 
density, symmetric about x = 0. Hence 


2 
Pr{-3<x<2}= | e ?Xl dx 
5 


6.7 Gaussian density or normal density 


The most widely used density is the Gaussian density, which is also called the normal 
density. [This is another unfortunate technical term. This does not mean that other 


148 —— 6 Commonly used density functions 


densities are abnormal or this is some standard density and others are abberations.] 
The density is the following: 


_ Gp)? 
Ποίκ) =ce 2%, -co«x«oo 


for -co < u < co, 0 > O where y (Greek letter mu) and o (Greek letter sigma) are param- 
eters, and c is the normalizing constant. 

Let us compute the normalizing constant c, the mean value and the variance. 
First, make a substitution 


The total probability is 1 and, therefore, 


o (uy co i co u2 
1=c| e ax? dx=co | e $du-2ec | e zdu, 
0 


-οο -οο 
since itis an even and exponentially decaying function. Hence 
99: a 
1Ξ cov | v2te"dv 
0 
_ x3 


" E ή 
by putting v = 5 > du = v: ἂν 


1 1 
1- cava ( 7.) = con > ς-----. 
2 σν2π 


For computing the mean value and variance we make the same substitution, y = = > 
dx = ody and x = u + oy. Then the mean value, 


because the first integral is the total probability in a Gaussian density with u = 0 and 
σ΄ = 1and the second is an integral over an odd function, where each piece gives con- 
vergent integrals, and hence zero. Thus the parameter p sitting in the density is the 
mean value of the Gaussian random variable. Now, let us compute variance, by using 
the following substitutions y = —* and u = ae 


Var(x) = E[x - Eco]? = E[x - u]? 
co [x - u]? _ ep? j ie y? x2 
— ———- 202 dx = —— 2 d 
is σν2π E " -00 ra y 


6.7 Gaussian density or normal density —— 149 


: ; ῃ _ X-H 
by using the substitution y = -σ-. Then 


2 [roo 2 
Var(x) - A | y2e F dy 
0 


by using the property of even functions, and it is 


=o" Γ μη 2g? 
o νπ 

by using the substitution u = 2 by observing that the integral is r(i) = ym. Hence 

the second parameter sitting in the Gaussian density is the variance, and o? is the 

standard notation for the variance. 

For convenience, we use the following standard notations: 

x ^ N(g, 0?): x is distributed as a Gaussian or normal distribution with the mean 
value μ and variance σ΄. For example, x ~ N(3,5) means normal with mean value 3 
and variance 5. 

x ~ N(0,0?): x is normally distributed with mean value zero and variance 0°; 

X ~ N(0,1):xis normally distributed with mean value zero and variance unity. This 
is also called a standard normal distribution and its density will be of the form: 


füG0-—e, -oo«x«oo. (6.28) 


The graph of a Νίμ, 0?) will be of the form as depicted in Figure 6.8. 


+ + t t + 
-30 -20 H-O u po +20 u+30 


Figure 6.8: Left: N(u, σ2); Right: N(0,1). 


In the graphs, we have marked the points 1σ or one standard deviation away from the 
mean value, two standard deviations away from the mean value and three standard 
deviations away from the mean value. These intervals are very important in decision 
making situations. The Νίμ, 0?) curve is symmetric about x = u and there are points 
of inflexion at x u - o and x -u + c. The N(0,1) curve is symmetric about x = 0 and 
points of inflexion at x = +1. Definite integrals of the form 


NE or NIS 


150 —— 6 Commonly used density functions 


are not available because the indefinite integral f e" dt is notavailable. But numerical 
tables of the standard normal density are available. These are called normal tables. 
You may find tables of the type: 
ji 1 2 dz b 1 2 dz 
—e 7 Or ——e 2 dz. 
—oo γ2π i γ2π 


Probability on any interval in the standard normal case can be computed by using one 
of the above forms of the normal tables. For example 


2 1 E 0 2 
f. quem NIS [ [Jax 
3 2 
= [πάτε [ tae 
0 0 
due to symmetry and both these integrals can be read from the tables. 

From the normal probability tables, we can see that approximately 6596 of the 
probability is within one standard deviation of the mean value, approximately 9596 of 
the area is within two standard deviations of the mean value and approximately 9996 
area is within three standard deviations of the mean value. As probability statements, 
we have the following: 


Pri - o «x <p +0} - Prix -l o] - Pr[[— <1} 


= 0.65. (6.29) 


Pru - 20 < x< p 20) = Pr{lx -ul <20} = Pr[[—] <2} 


= 0.95. (6.30) 


Pru 30 <x <p+30} = Prix -ul <30} = Pr{|*—#| <3} 


= 0.99. (6.31) 


These three observations (6.29), (6.30), (6.31) are very important in testing of statistical 
hypotheses and in making “confidence statements” on the parameter μ. 


Example 6.5. It is found that the monthly incomes of working females in a city are 
approximately normally distributed with mean value Rs 10 000 and standard devia- 
tion Rs2000. (i) What is the range of incomes, around the mean value, where 9596 of 
the working females can be found? (ii) If a working female is picked at random from 
this city, what is the probability that her income is between Rs 8 000 and Rs 14 000? 


Solution 6.5. Approximately 9596 of incomes, around the mean value, can be found 
in the range y - 20 < x < u + 20 when x denotes the monthly income. The range is 


[10000 - 2 x 2000,10000 + 2 x 2000] = [6 000,14 000]. 


6.7 Gaussian density or normal density —— 151 


This answers (i). For (ii), we need the probability Pr{8 000 < x < 14000}. We will stan- 
dardize x. 


Standardization of a random variable x means to consider the random variable y = 
LEWO _ XH ; ; 5 - 
“πια o 50 that the new variable y is such that E(y) = 0 and Var(y) - 1. 


For (ii), denoting the standardized x as z we have 


Pr(8000 « x « 14000} 
-pP [2000-10000 χμ, 14000-10000} 
2000 σ 2000 
= Pr{-1 < z < 2} = Pr(-1«z < 0] + Pr{0 <z < 2] 


= Pr{0 <z <1} +Pr{0<z<2} from symmetry 
= 0.325 + 0.475 = 0.8 approximately. 


The probabilities are read from standard normal tables. 


6.7.1 Moment generating function of the normal density 


Since the Gaussian density is one of the most important densities in statistical and 
probability literature, the moment generating function in the Gaussian case is also 
very important. Again, using the same notation 


M(t) = E[e^] = at. |. e ag OW dx 
σνζπ) ο ᾿ 
Make the transformation y = ΞΕ. then ἦγ - idx and the limits will remain the same, 
—co < y < oo. Then x = u + oy. The exponent in the integral above, simplifies to the 
following: 


γ᾽ 


d 
1 
-tu- 5D? - 20t] 


1 
tx- gi - Y = tH + oy) - 


=tu- τν -20ty «o? E - 0°t?] 
1 3s. Ëo? 
= tu- z[(y - ot] + —. 
1E: OO | Fe 
Substituting these and replacing dx by dy we have 


1 2 

σὰ (99 e-50-90 

M(t) -- οἴ η | -------- 
-oo Ύ2π 


The integral part can be looked upon as the total probability, which is 1, from a normal 


dy. 


152 —— 6 Commonly used density functions 


density with parameters to and 1. Therefore, 
202 
M(t) 2e T. (6.32) 
That is, 


fo" 
χ-Ν(μ,σ) = M(t) =exp|tu+ L] 


252 
x~ Ν(0,σ) => M(t) =exp| |. 


x-N(01 > M(t) exp 5], 


The characteristic function of the Gaussian random variable or Gaussian or normal 
density is obtained by replacing t by it, i= V—1. Denoting the characteristic function 
by f(t), we have 


to? 
P(t) = exp | ity - <] (6.33) 
for the normal or Gaussian case. When x ~ N(0, 1), the standard normal, then its char- 


acteristic function is 
2 


P(t) = exp|-5]. (6.34) 


From the moment generating function (6.32) or from the characteristic function (6.33), 
one property is obvious. What is the distribution of a linear function of a normal ran- 
dom variable? Let x ~ N(u,07) and let y = ax + b, a #0, where a and b are constants. 
Then the moment generating function (sometimes abbreviated as mgf) of y, denoted 
by M,(t), is given by 


M,(t) = Ele] = E[ef***9] =e Efex] 
2 
E exp| tb + fatu + aH 
[2 
= exp| t(ay + b) + 5 (@0?)], a#0. 


But this is the mgf of a normal variable with parameters αμ + b = E[y] = E[ax + b] and 
ασ = Var(y) = Var(ax + b). Therefore, every linear function of a normal variable is 
again a normal variable. 


Result 6.1. If x - N(j,0?) then y = ax + b, a + 0, -N(ap + b, a?o?). 


Note 6.1. When a = 0, then the mgf is e^. which is the mgf of a degenerate random 
variable with the whole probability mass 1 at the point x = b (Observe that b can be 
zero also). 


6.8 Transformation of variables —— 153 


For example, 


x~N(u=2,07=5) = y=-3x+4~N(-2,45). 
x~N(u=-4,07=1) > y=-2x+7~N(15,4). 
x~N(u=0,07=1) => y=3x+4~N(4,9). 


Let us see what happens if we take the n-th power of this mgf for a normal random 
variable. That is, 


22 
M(0se"T > 


[M(t)]" = Canals = ets Sn(o) 


But this is the mgf of a normal or Gaussian variable with the parameters nj and no?. 
If the corresponding random variable is denoted by u then u ~ N(nj, no?). This is a 
property associated with “infinite divisibility" of a random variable, which will be 
discussed after considering independence of random variables in the next chapter. 
Observe that if M,(t) is the mgf of x then [M,(t)]" does not mean the mgf of x". 


(M, (t)]" + M,n(t). 


We have seen the infinite divisibility property for a gamma random variable also. If z 
is a two parameter gamma random variable, then we have seen that its mgf, denoted 
by M, (8, is (1— β8) ^. Therefore, 


[M;(]" = [a - B ^] = (1 - pos, 


which is the mgf of a gamma variable with the shape parameter na and scale param- 
eter f. 


6.8 Transformation of variables 


Here, we consider the problem of finding the probability or density of a function g(x) 
of a random variable x, given the probability or density function of the random vari- 
able x. As an example, suppose that we know that the waiting time at a certain queue 
is exponentially distributed with the expected waiting time one hour. Suppose that 
for a person waiting at this queue it costs him Rs 500 per hour of time lost plus the 
transportation cost of Rs 40. This means, if t is the actual waiting time then his loss 
is g(t) = 40 + 500t. Knowing the distribution of t [known to be exponential here] we 
want the distribution of 40 + 500t. As another example, suppose that a working girl is 
appearing for an interview for promotion. If x is the number of correct answers given, 
then her salary is likely to be x? + Rs 2000 (fringe benefits at the next position). Here, 
x is a binomial random variable and we want the distribution of y = 2000 + x?. Prob- 
lems of this type will be examined here. First, let us examine discrete cases. If the 


154 —— 6 Commonly used density functions 


probability function of x is given and if we need the probability function of y = g(x), 
some function of x, then the problem is answered if we can compute the probability 
for each value g(x) takes, knowing the probability function of x. Substitute the values 
of x, for which there are non-zero probabilities, into g(x) and evaluate the correspond- 
ing probabilities. Then we have the answer. This will be illustrated with an example. 


Example 6.6. Suppose x has the probability function: 


0.25, x--1 
0.25, x=1 
foo) = 
0.5, x=0 
0, elsewhere. 


Compute the probability function of (i) y = x’; (ii) y = 3 + 2x + 5x?. 


Solution 6.6. (i) When x = -1, y = x? = 1 with probability 0.25. When x -1, y =x? =1 
with probability 0.25. No other x-value gives y = 1. Hence the probability associated 
with y = 1 is 0.25 + 0.25 = 0.5. When x = 0, y = x? = 0 with probability 0.5. This com- 
pletes the computations and hence the probability function of y, denoted by h,(y) is 
the following: 


05, y=1 
hy) = 40.5, y=0 


0, elsewhere. 


For (ii) also the procedure is the same. For x = -1, y = 3+ 2x + 5x? = 3+2(-1) +5(-1)* =6 
with probability 0.25. When x = 1, y = 3 + 2x + 5x? = 3 + 2(1) + 5(1)* = 10 with probabil- 
ity 0.25. When x = 0, y = 3 + 2x +5x? = 3 + 2(0) + 5(0)* = 3 with probability 0.5. Hence 
the probability function of y, denoted by h,(y), is the following: 


0.25, y=6 

0.25, y=10 
h,(y) = 

05, y=3 

0, elsewhere. 


Whatever be the function g(x) of x the procedure is the same in the discrete case. 


For the continuous case, the procedure is different. We have to look into the Jaco- 
bian of the transformation, which means that we need a one-to-one function. Let x be 
a continuous random variable and let g(x) be a one-to-one function of x over a given 
interval. Some situations are shown in Figure 6.9. In (a), we have an increasing func- 
tion, which is one-to-one. In (b), we have a decreasing function, which is one-to-one. 


6.8 Transformation of variables —— 155 


[RET 


ω]------5 
ETETETT 
ωρ---- 
-|---- 


aa 


Figure 6.9: (a) increasing, (b) decreasing, (c) increasing/decreasing in each sub-interval. 


In (c), we can subdivide the interval into two pieces where in each piece the function 
is one-to-one, and hence we can apply the procedure for each piece separately. 

Let the distribution functions (cumulative density) of x and y - g(x) be denoted 
by F,(x) and F,(y), respectively. Then 


Prxsaj-F,(a) and Pr{y<b}=F,(b). 


If the function y - g(x) is increasing as in Figure 6.9 (a), then as the point on the x-axis 
moves to the right or as x increases the corresponding point on the y-axis moves up or 
y also increases. In this case, 


Prix <a} =Pr{y<b=g(a@}=Priy<gia)} = F,(a)=F,(g(a)). (6.35) 


We can differentiate to get the density function. Observe that & FQ) = fi(x) where 
f, O0 is the density function of x and IF, (y) = f(y) where f,(y) is the density of y. Let 
us differentiate (6.35) on both sides with respect to a. Then on the left side we should 
get the density of x evaluated at x = a. That is, 


d d 
fi = 507 a, y's) 
d d 
- ag (60) x78) = 
Πα) =f,(g(a))g'(a) => 
Reo =f) x y, (6.36) 


This is the connection between the densities of x and y when y is an increasing func- 
tion of x. Now, let us see what happens if y is a decreasing function of x as in Fig- 
ure 6.9 (b). Observe that when a point is moving to the right on the x-axis (increasing) 
the corresponding point on the y-axis is decreasing. Hence the connection between 
the probabilities is the following: 


Prix <a}=Pr{y>b=g(a)}=1-Priy<g(@} = 
F,(a) =1-F,(g(a)). 


156 —— 6 Commonly used density functions 


Differentiating both sides with respect to a we have 


f,(a) = SF a) = -oF F, (g(a)) 


d 
- dice x κα. > 


fi(a)= -fo(g(a)) 4. xc). 


1 2 1 x p 


Thus, when y is a decreasing function of x then the formula is (6.37). Note that when 
y is decreasing, y will be negative, and thus the right side will be positive all the 
time. Thus, in general, the formula is (6.36), and when y is decreasing then multiply 
the right side by --1. In practical computations, this minus sign is automatically taken 
care of in the limits of integration and hence the formula to be remembered is (6.36) 
and take the correct limits of integration. 


Example 6.7. Let x be a continuous random variable with density f (x) and distribu- 
tion function F,(x), which is a function of x. Consider the transformation y = F,(x) 
[This transformation is called the probability integral transformation, which is the ba- 
sis for the area called statistical simulation and also the basis for generating random 
numbers or taking a random sample from a given distribution. Evaluate the density 
of y. 


Solution 6.7. Thisisa one-to-one transformation. y is an increasing (non-decreasing) 
function of x. Applying the formula (6.36), we have 


ΠΕΙ 5 
fio - fi) s Ευ) -ho - 
1=f,0). 


That is, y is uniformly distributed on the interval [0,1]. Thus, through a probability 
integral transformation any density can be brought to a uniform density over [0, 1]. 
This is the importance of this transformation. 


Example 6.8. Let x be exponentially distributed with mean value 5. Let y = 2 + 3x. 
Compute the density of y. 


Solution 6.8. When x goes from 0 to co, y = 2 + 3x is an increasing function of x. 
Hence we can use the formula (6. 36). 3 y -3andx- "E 2 Also when x = 0, y-2and 


6.8 Transformation of variables —— 157 


when x — co, y — co. Therefore, 


ACO=fh)x3 => πα. m 


_ 0-2 


1 

πε 5, 2<y<oo 
Ρ)5Ξ 1 5 
d O, elsewhere. 


Thus y is a relocated re-scaled exponential random variable. 


Example 6.9. Let x be a standard normal variable and let y = x*. Compute the density 
ofy. 


Solution 6.9. Here, x goes from --οο to co, and hence y = x? is not a one-to-one func- 
tion in the whole range. But in the interval —co < x < 0 (the function is strictly decreas- 
ing), and in the interval 0 « x « oo (the function is strictly increasing) the function 
y = x? is one-to-one in each of these intervals. Hence we can apply formula (6.36) in 
the interval 0 < x < co and (6.37) in the other interval --οο < x < 0. The curve y = x? is 
shown in Figure 6.10. 


—> 


—x 


Figure 6.10: y = x°. 


2 


For positive x, y = x^ 2x = γ᾽ > X - I y? . Let the piece of the standard normal den- 
sity in the interval 0 < x < co be denoted by f}, (x), that is, 


2 


1 DUE 
fad = πε 2, Oxx«oo 
0, elsewhere 


so that f(x) = f 00 + fi3 (0, where fi; (x) is the corresponding piece of N(0,1) density 
over --οο < x < 0, and the corresponding piece of the density of y be denoted by f; (y). 
Then from (6.36) 


x 
1167 
Ξ ly? , O<sy< 
fn) 3" VR y«oo 
.1 1! ria? 
το γ : 


158 —— 6 Commonly used density functions 


From the symmetry of y and due to f, (x) being an even function, f,,(y) corresponding 
to fj; (x) also gives exactly the same function from (6.37). Hence 


1 ici cr 
nom ez, O<y<oo 
hy)= 22T) 
0, elsewhere. 


But this isa gamma density with parameters a = J and f = 2 oritis a chi-square density 
with one degree of freedom; see Section 6.3.2. 


Result 6.2. When x is standard normal, then y = x? is a chi-square with one degree 
of freedom or 


x~N(O,1) > y=xXŻ ~X. 


Another method of showing that when x ~ N(0,1) then y = x? ~ x? or a chi-square 
variable with one degree of freedom is to use the distribution function of y itself. 
The distribution function of y, denoted by F,(z) = Pr{y «2,2 » 0} is such that g,(z) = 
ir, (z) where g, (z) is the density of y, evaluated at y = z. Note that 


F,(z) = Prly x zz > 0) = Prix? < z} = Pr{|x| < vz} 
= Pr(- vZ < x < vz} =F,(vz) - F,(- vz) 


where F,(-) is the distribution function of x, and the density of x, denoted by f, (z) = 
ir, (z). Therefore, differentiating the above with respect to z we have 


gy(2) = F,(z)- SR) - FC] 


p poris Tei 


1 PE NE 
- 22762, Oxzc«oco 


2T) 


and zero elsewhere, which is the density of a chi-square random variable with one 
degree of freedom or a gamma variable with the parameters (a = 3, B=2). 


Example 6.10. Let x ~ N(0,1) and let y = 5x? — 3. Compute the density of y. 


Solution 6.10. Let u = x’. Then we have from Example 6.9 that u is a chi-square vari- 
able with one degree of freedom. Let the density of u be f (u). Then 


1 u 
πο em Oxuc«oo 
fiu) = 121» 


0, elsewhere 


6.9 Anote on skewness and kurtosis —— 159 


But y tou isa one to one transformation. y = 5u - 3 > u = ye, dy =5du,0<u<co> 
-3 < y < co. Therefore, if the density of y is denoted by f, (y) then 


uu ete -3&y«oo 
0) = "2 


0, elsewhere. 


6.9 A note on skewness and kurtosis 


Skewness is often misinterpreted as asymmetry in a distribution. Skewness is associ- 
ated with the median. In a continuous case, Pr{x x M} = Pr{x > Mj = 7 or probability 
to the left of the median point M is the same as the probability to the right of M. If the 
range of x to the right of M is not equal to the range of x to the left of M, then there is 
possibility of skewness. If the probability 0.5 on one side of the point M is stretched 
out compared to the other side, then the density is skewed to the stretched outer side. 
A density curve can be asymmetric but need not be skewed. Some possibilities are 
marked in Figure 6.11. 


/ 05 los ay | 


Figure 6.11: Symmetric, asymmetric, skewed to right, skewed to left densities. 


A scale-free measure based on the third central moment, such as 


H21? 


where u; = Efx - E(x)]?, p = EIx - Eoo]? = o°, is often used to measure skewness. But 
s can only measure asymmetry rather than skewness. But 
Εἰκ-- MP 


sel. (6.39) 
E[x - M]? 


can measure skewness to some extent where M is the median. If s; » 0, then one may 
say that the density is skewed to the right and if s, « O then skewed to the left. We 
shall not elaborate on this aspect further because the measures s or s, is not a unique 
property associated with any shape. Kurtosis has something to do with peakedness 
or flatness of a density curve or probability function. When we say more flat or more 
peaked then there has to be a standard item to compare with. The normal (Gaussian) 


160 —— 6 Commonly used density functions 


density curve is taken as the standard for comparison purposes. For a Gaussian density 
when we compute the ratio, 


ΕΙΧ- ΕΙ’ μι 


~ [E(x - Eco]? m (6.40) 


then it is k = 3 for the Gaussian case. Hence the comparison is made with this num- 
ber 3. For a given distribution if k » 3, then we say that the distribution if lepto-kurtic 
(more peaked) and if k « 3, then we say that the distribution is plati-kurtic (more flat) 
as shown in Figure 6.12. 


Normal lepto-kurtic Platy-kurtic 


Figure 6.12: Left: normal; Middle: lepto-kurtic; Right: plati-kurtic. 


But the items s, s,, k given above are not characterizing quantities for distributions. In 
other words, these measures do not uniquely determine (characterize) distributions 
or shapes. Hence no unique conclusions can be made by using these measures. We 
have already seen that a property such as mean value μ being equal to variance o? 
is enjoyed by the Poisson random variable as well as by a gamma random variable 
with the scale parameter f = 1. Hence it is not a characteristic property of any random 
variable. Similarly, for a Gaussian random variable k in (6.40) is 3. But k = 3 is not 
a characteristic property of the Gaussian density. Hence the concepts behind it and 
comparison with 3 do not have much significance. Hence, nowadays, the students are 
unlikely to find discussion of skewness and kurtosis in modern probability/statistics 
books. 


Note 6.2. In Chapter 5, we derived most of the discrete probability functions by look- 
ing at experimental situations satisfying some conditions. When it came to densities, 
for continuous random variables, we could not list experimental situations, other than 
for the case of uniform distribution. Are there experimental situations from where den- 
sities could be derived? The answer is in the affirmative. The derivation ofthe densities 
from the basic assumptions to the final densities involve the concepts of joint distribu- 
tions, statistical independence, etc. of random variables. Hence we will consider such 
problems after discussing joint distributions. 


6.10 Mathai’s pathway model —— 161 


6.10 Mathai's pathway model 


A very general density with a switching mechanism, introduced by Mathai [5] has the 
following form in the particular case of real scalar variables: [The above paper [5] is 
on rectangular matrix variate functions.] 


n 
= 


£60 = cxi! [1 - ad -- g)|xl?] Fa (6.41) 


fora»0,6»0,9»50,y»0,-co«x«oo, and 1- a(1- q)Ixi? » 0, and zero elsewhere, 
where c is the normalizing constant. A particular case of (6.41) for x » 0 is the follow- 
ing: 


n 
Ta 


g0) Ξ cx'[1- a - qx?] Fa (6.42) 


fora»0,6»0,y»0,09»50,1-a(1- qx? » 0. Here, x will be in a finite range with 
a non-zero function for q « 1. Observe that when q « 1 then the density in (6.42) stays 
in the generalized type-1 beta family. Generalized type- beta in the sense that if y is 
type- beta as described in Section 6.5 then consider a transformation y = a(1 - q)x? 
then the density of x will reduce to the form in (6.42). 

When q > 1, then 1- q = - (q - 1) and the density, denoted by g(x), has the form 


E 


£j) =X [1 a(q - Όχδ] «i (6.43) 
for O < x < œ, a > 0, ô > 0, ņ > O, q > 1, y > O, and zero elsewhere, where c; is the 
normalizing constant. The form in (6.43) is a generalized type-2 beta family in the sense 
if y is type-2 beta as in Section 6.5 then consider a transformation y = a(q - 1)x? then x 
will have the density of the form in (6.43). 

Now considering the limiting process of q going to 1 either from the left or from the 


right. From the property, coming from the definition of the mathematical constant e, 
that 


we have 
dim δια) = Tim 8,60 = 8500 
where 
g3(x)= cyxY- tea (6.44) 


for O <x < œ, a > 0, ņ > 0, 6,y > O and zero elsewhere. This is the generalized gamma 
density. In the pathway model, q is called the pathway parameter because through q 
one can go from a generalized type-1 beta family of densities to a generalized type-2 


162 —— 6 Commonly used density functions 


beta family of densities to a generalized gamma family. Thus all these families of func- 
tions are connected through this pathway parameter q. By making the substitution 
y -a(1- φ)χδ in (6.42), y = a(q — 1)x? in (6.43) and y = ax? in (6.44) and then integrat- 
ing out by using type-1 beta, type-2 beta and gamma integrals respectively, one can 
compute the normalizing constants c, c;, c3. [This evaluation is given as an exercise 
to the students.] 


6.10.1 Logistic model 


A very popular density in industrial applications is logistic model or logistic density. 
Let us denote it by f(x). 


e* e* 


ley Ξ rey 00 < X < co. (6.45) 


foo 


Note that the shape of the curve corresponds to a Gaussian density but with thicker 
tails at both ends. In situations where the tail probabilities are bigger than the corre- 
sponding areas from a standard normal density, then this logistic model is used. We 
will look at some interesting connections to other densities. Let us consider a type-2 
beta density with the parameters a > 0 and f > 0 or with the density 


. T(a+B) 


- x*1(04x) €*P, O<x<0o, a>0, B>0 (6.46) 
Γ(α)Γ(β) P 


g(x) 


and zero elsewhere. Let us make the transformation x = e" or x = e" then -co < y < oo 
and the density of y, denoted by gi(y), is given by 


OAIB e" I(a«f) e 


- T(a +B) (1+ eY)a*P B Γ(α)Γ(β) (+ e)e*f (6.47) 


gy) 


for a > 0, B > 0, -co < y < oo. Note that for a = 1, β- 1, (6.47) reduces to (6.45) the lo- 
gistic density. This (6.47) is the generalized logistic density introduced by this author 
and his co-workers and available from a type2 beta density by a simple transforma- 
tion. If we put x = m, a= T, p= 5 in (6.46), then the density of z is the F-density 
or variance ratio density. If a power transformation is used in (6.46), that is, if we re- 
place x by at?, p > 0, a > 0, then (6.46) will lead to a particular case of the pathway 
density in Section 6.10 for q » 1. For all the models described in Sections 6.1 to 6.11, 
one can look at power transformations and exponentiation. That is, replace x by ay’, 
a > 0, p > Oor x bye" ,then we end up with very interesting models which are useful 
when models are constructed for given data, see the effects of such transformations 
from [7]. There are other classes of densities associated with Mittag—Leffler functions. 
These functions naturally arise in fractional calculus, especially in the solutions of 


fractional differential equations. Such models may be seen from [6]. 


6.11 Some more commonly used density functions ---- 163 


6.11 Some more commonly used density functions 


In the following list, only the non-zero parts ofthe densities are given. It is understood 
that the functions are zeros outside the ranges listed therein. 


fux) = ἡ sin vx, exei 
π 
[Arc-sine density]. 


2r ; 
f (0) = η (sin 9)^^:-1(cos ϐ) 221 


for 0 < 0 < 1/2, p, > O, p, > 0 [Beta type-1 polar density]. 

2 T x suis E 
fg) = x -- 
forO<x<s,m>0,n>0, 


B I(s(QQm+n+s +1) 
S 5 (2m+s+1) τί semen yr snis ) 


[Beta type-1, three-parameter density]. 


I(a +B) 
Γ(α)Γ(β) 
2 


xa (asa) 


vi 
e$ xt1(1- xf 


figo = 


for0<x<1,A>0,a>0, B > O and ,F, is a confluent hypergeometric function [Beta 
type-1 non-central density]. 


. s(2m+2n+s+1) cj 
2: 


s(2m+s+1) x 
Πεία)-οχ 2 x + 3 


for0xx«oco,s»0,m»0,n»0, 


υπ. 4 1) 


s(2m+s+1) 


s 3. P(X) yr(sn +1) 
[Beta type-2 three-parameter density]. 


pee Γία -- β) 
ΠΟΠ; 
2 


ÀA ox 
ZA ια Έτη) 
ΓΡ 2 14x 


xl) wp 


for O < x < œ, a > 0, B>0,A>0 [Beta type2 non-central density]. 


164 ---- 6 Commonly used density functions 


Γ(α--β) xp? 
Γ(α)Γ(β) (x + b)'*? 


fig(x) = 
for a > 0, B>0, b > 0, 0 < x < oo [Beta type2 inverted density]. 
Ποία) = [27o (k) J !ekcos 26-5 


for O0<x <7, 0<ß <n, 0< k< o0 where Ij(k) is a Bessel function [Circular normal 
density; also see bimodal density]. 


(1-p)? 


Πο) = 2n(1- p sin 2x) 


for p? «1, 0 < x <7 [Bimodal density]. 


1 


fnt = c[-1 + exp(a + Bx)] 


e" .) [Bose-Einstein density]. 


for 0 «x «co, $> 0, ef >1, c= gIn4 


A 


fn) = n[A + (x - n3] 


for -oo < x «oo, A> 0, -oo«g < oo [Cauchy density]. 


(1-07) 
2n[1 + 0? - 20 cosx] 


fg) = 


for 0 <x <27, 0<o<1[Cauchy wrapped up density]. 


A) gi 
for 0 < x < co [Cauchy folded density]. 
25 ) x" 1 zu 
fos) = or) Geet 


for x 20,0 > 0, na positive integer [Chi density]. 


x 


po Ὁ 2 LX 
fag(x) = @ 27 > =( a pn Sz 
5r! 


20? I(E +r) 


for x > 0, -co < u < co, 0 > 0, ka positive integer [Non-central chi-square density; u = 0 
gives chi-square density; see also gamma density]. 


PA nn 


for x 20, a 20, p a positive integer [Erlang density; see also gamma density]. 


aie 


fy) = 


6.11 Some more commonly used density functions —— 165 


a 
(x)= fe ?l, o5 «x < 00, a>0 
28 2 


[Exponential — bilateral density or Laplace density ; with x replaced by x - c we get 
exponential-double density]. 


fog (x) = a* (a 4 1)? [r(p)] xe exp[-(a + 1)x] F,(a; p; x) 


for 0 < x < œo, a > 0, p > 0, a > 0 [Exponential - generalized density]. 


—X 


fv) = ar 


a_ e-b) 


fora <x < b, a > 0, b > 0 [Exponential - truncated density]. 
1 -10-4) 
Fuld = ge B forÀAxx«oo, 


0 « B < co, -co < À < co [Exponential -- two-parameter density]. 


fax) = zely -exp(-y) for -oo <x <œ, 0< B. « oo, 


-0 <À <œ, y= τ [Extreme value, first asymptotic density]. 


ώς Hr) "E 2] 


for x > 0, v, k > 0 [Extreme value, second asymptotic density]. 


Γαία) = LL ) ev 


(-v)\v 


for x <0, v>0, k > 1 [Extreme value, third asymptotic density]. 


μῶν jen (220) ο] 


for x 2 0, A> 0 [Extreme value, modified density]. 


fx) = τ (n zm 


T(IG)Nn/ (a+ Bx) 


for x > 0, m,n positive integers [Fisher's F-density]. 


fay) = et IS) yr 
m n m+n 2 
MDG) a ey) ER 2:5 2 


for x > 0, A> 0, m,n positive integers, x = ^ y [F-non-central density]. 


166 ---- 6 Commonly used density functions 


f3500 = [c1 + exp(a + Bx)}]7 


1-64 
θά 


for O <x < œ, a 20, B > 0, c = In(—z-) [Fermi-Dirac density]. 


k 
2sinhk 
for 0 < 0< 7, k > 0 [Fisher's angular density]. 

vk exp[-5(b - In(x - a))?] 
γ2π (x-a) 


-ρο < b < co [Galton's density or Log-normal density]. 


ek cos@ sin 0, 


f0) = 


fork»0,a«x«oo, a»0, 


fzx) az: 


fro) = 2 GH COG) 
ja 


for -co < x < co, where φ(χ) is the standard normal density and Hj(x) is the Cheby- 
shev-Hermite polynomial of degree j in x defined by (-1) 5 P(x) =H;(x)$(x) and c; = 
F ien Hy (x)f (x)dx [Gram—Charlier type A density]. 


(mgx) 


fao) = ( τ e Sn 


for x 20, m > 0, g > 0, T > 0 [Helley's density]. 


1 

5(n-1) n-2 

ων (y κο 
(UU ozod \o 


for 0 < x < œ, 0 > 0, n a positive integer [Helmert density]. 
fa300 = [t coshx]! 
for --οο < x < oo [Hyperbolic cosine density]. 
fa) ΞΡ 
fore? <x <1, p > 0 [Hyperbolic truncated density]. 


A? ox | Ac 


: (χμ); 


Ποία) = TON 


for x > 0, À, u > 0 [Inverse Gaussian density]. 
exp[- g(x - a)] 
Bl + expC- βία -α)]2 


fag QO = 


for -co «x < oo, B > 0, -co «a < oo [Logistic density]. 


exp[-(Inx - μ)’/(2σ”)] 


1 
007 χσγ2π 


ΕΟΙΧ20,σ20,-οο«µ« co [Log normal density]. 


6.11 Some more commonly used density functions —— 167 


fag) = Ape exp(-Bx?) 
for x > 0, B > 0 [Maxwell-Boltzmann density]. 


hota) = E(R" 


x 
for x > Χρ, a > 0 [Pareto density]. 


(x — α)[ορ(α) 


bo + bx + b,x? 


d 
ax 5009 z 


where fzọ(x) is the density function. The explicit solutions are classified into types 
I-XII according to the nature of roots of by + b,x + b,x? = 0 [Pearson family]. 


fa 60 = S; exp|- S exa? | 


for x 2 0, a » 0 [Rayleigh density]. 


_ Bexpl[a * Bx] 
faa) [1+ exp(a + Bx) 


for -00 < x < oo, B > 0, -co «a < co [Sech square density]. 


Tv + 2 Bye 


fa x)= “Waly Ἐπ 


ν 
for -co < x < oo, v a positive integer [Student-t density]. 
fao) = -(1- τ for|x| xa, a» 0 
[Triangular density; there are several modifications of this density]. 
fas) = [2nIg(I)] 1ος for 0 «x « 27, 
Oxpz2n,0xk «oco where I,(k) is a Bessel function [Von Mises density]. 
f(x) = m(x - a)" 16.1 exp |- ; (x — a)" 
for x >a, 0 > 0, m > 0, a> 0 [Weibull three parameter density; for a = 0, we have the 


usual Weibull density, which is a special case of generalized gamma]. 


Exercises 6 


6.1. Evaluate E(x) and E(x’) for the uniform density by differentiating the moment 
generating function in (6.2). 


168 — 6 Commonly used density functions 


6.2. Obtain E(x), E(x’), thereby the variance of the gamma random variable by using 
the moment generating function M(t) in (6.14), (i) by expanding M(t); (ii) by differen- 
tiating M(t). 


6.3. Expand the exponential part in the incomplete gamma integral y(a; a), integrate 
term by term and obtain the series as a ,F, hypergeometric series. 


6.4. Expand the factor (1 -- x)?! in the incomplete beta integral in (6.27), integrate 
term by term and obtain a „F; hypergeometric series. 


6.5. Show that the functions fi (x) to fz;(x) given in Section 6.11 are all densities, that 
is, show that the functions are non-negative and the total integral is 1 in each case. 


6.6. For the functions in Exercise 6.5, compute (1) E(x), (2) Var(x); (3) the moment 
generating function of x, whenever these exist. 


6.7. Forthe functions in Exercise 6.5 compute (1) the Mellin transform; (2) the Laplace 
transform wherever they exist and wherever the variable is positive. Give the condi- 
tions of existence. 


6.8. Let f(x) be a real-valued density function of the real random variable x. Let y be 
another real variable. Consider the functional equation 


fof =f ἐγ) 


where f is an arbitrary function. By solving this functional equation, show that f (x) is 
a Gaussian density with E(x) = u = 0. 


6.9. For the Exercise in 6.8, let z be a real variable and let the functional equation be 


OOF OF) - faba «y? +22). 
Show that the solution gives a Gaussian density with E(x) = u = 0. 


6.10. Shannon’s entropy, which is a measure of uncertainty in a distribution and 
which has wide range of applications in many areas, especially in physics, is given by 


$--c IN [In f] Godx 


where c is a constant and f(x) is a non-negative integrable function. Show that if S is 
maximized over all densities under the conditions (1) (i) jp f (x)dx = 1 then the result- 
ing density is a uniform density; (2) Show that under the conditions (i) and (ii) E(x) = 
is given or fixed over all functional f then the resulting density is the exponential den- 
sity; (3) Show that under the conditions (i), (ii) and (iii) E(x?) = a given quantity, then 
the resulting density is a Gaussian density. [Hint: Use calculus of variations]. 


6.11 Some more commonly used density functions —— 169 


6.11. The residual effect x of two opposing forces behaves like small positive or nega- 
tive residual effect having high probabilities and larger residual effect having smaller 
probabilities. A Laplace density is found to be a good model. For a Laplace density 
of the form f(x) = ce Fl Loo < x < oo, where c is the normalizing constant, compute 
(1) Pr{-5 < x x 3}; (2) E(x); (3) moment generating function of x. 


6.12. If x - N(u=2,07 = 5) derive the densities of (1) γι = 2x; (2) yy = 2x + 5; (3) γα = 3x? 
when g = 0, o? = 1; (4) Y, =2+ 3x? when μ-ο, 9? - 1, by using transformation of 
variables. 


6.13. If x ~ gamma(a = 3, f = 2) derive the density of (1) y, = Αχ}; (2) y; = 4x? +3 by 
using transformation of variables. 


6.14. Under a probability integral transformation, an exponential variable x with ex- 
pected value E(x) Ξ 3 makes y a uniformly distributed variable over [0,1]. What is the 
x value corresponding to (1) y = 0.2; (2) y= 0.6. 


6.15. If M, (t) 15 the moment generating function of a random variable x, is [M,(t)] xc 
n - 2,3,... a moment generating function for some random variable? List at least two 
random variables where this happens, from among the random variables discussed 
in this chapter. 


6.16. Let x be a type-1 beta random variable with parameters (a, f). Let (1) y = Lx 
()2- Lx (3) u 21- x. By using transformation of variables, show that y and z are 
type-2 beta distributed with parameters (α,β) and (f, a), respectively, and that u is type- 
1 beta distributed with parameters (f, a). 


x 


6.17. Let x be a type-2 beta random variable with parameters (a, f). Let (1) y = ἜΤ 
(2)z= Ln (3u- i, By using transformation of variables show that y and z are type-1 
beta distributed with the parameters (α,β) and (f, a), respectively, and that u is type-2 
beta distributed with the parameters (£, a). 


6.18. By using the moment generating function show that if x is normally distributed, 
that is, x ~ Νίμ, 0?), then y = ax + b is normally distributed, where a and b are con- 
stants. What is the distribution of y when a - 0? 


6.19. If x is binomial with parameters (n = 10, p = 0.6) and y is Poisson with parameter 
A=5, evaluate the probability functions of u = 2x + 3 and v = 3y - 5. 


6.20. Evaluate the probability generating function, E(t*), for (1) Poisson probability 
law, (2) geometric probability law; (3) negative binomial probability law. 


6.21. Consider a Poisson arrival of points on a line or occurrence of an event over time 
according to a Poisson probability law as in (5.16) with the rate of arrival a. Let x denote 
the waiting time for the first occurrence. Then the probability Pr{x > t} = probability 
that the number of occurrence is zero = e™%, t > 0. Hence Pr{x < t} = 1- e^*5, t » 0, 


170 —— 6 Commonly used density functions 


which means x has an exponential distribution. If y is the waiting time before the r-th 
occurrence of the event, then show that y has a gamma density with parameter (r, 1) 


6.22. Let a random variable x have density function f (x) and distribution function 
F(x) then the hazard rate A(x) is defined as A(x) = τ fs . Compute the hazard rate when 
x has (i) exponential density; (ii) Weibull density; (iii) logistic density. 


7 Joint distributions 


7.1 Introduction 


There are many practical situations where we have pairs of random variables. Exam- 
ples are many in nature. (x, y) with x = blood pressure of a patient before adminis- 
tering a drug, y = blood pressure after administering that drug; (x, y) with x = height, 
y = weight of an individual; (x, y) with x = waiting time for an interview, y = the inter- 
view time; (x, y) with x = amount of fertilizer applied, y = yield of tapioca, etc. In these 
examples, both variables x and y are continuous variables and they have some joint 
distributions. Let us consider the following cases: (x, y) with x - number of questions 
attempted in an examination, y = number of correct answers; (x, y) with x = the num- 
ber of local floods, y = number of houses damaged; (x, y) with x = number of monthly 
traffic accidents on a particular road, y - the number of associated injuries, etc. These 
are all pairs where both x and y are discrete. Now, consider the situations such as the 
following: (x, y) with x - number of monthly accidents, y - the amount of compensa- 
tion paid; (x, y) with x - number of computer breakdowns, y - the duration of working 
hours lost, etc. These are pairs of variables where one of them is discrete and the other 
is continuous. We will consider joint distributions involving pairs of variables first and 
then we will extend the theory to joint distributions of many real scalar variables. 

We will introduce the functions as mathematical quantities first and then we will 
look into experimental situations where such probability models will be appropriate. 


Definition 7.1 (Joint probability/density function). A function f(x,y) is called a 
joint probability/density function of the random variables x and y if the following 
two conditions are satisfied: 
(i) f(x,y) 2 0 for all real values of x and y; 
(1) [55 [5 foyd ^ dy = 1if x and y are continuous, where ^ product of differ- 
entials is explained in Module 4; 
DUE PM UT f(x,y) =1if x and y are discrete. (Sum up the discrete vari- 
able and integrate the continuous variable in the mixed case.) 


Note 7.1. The wedge product or skew symmetric product of differentials is defined 
by the equation 


dy A dx = -dx ^ dy 


so that dx A dx = 0. Applications in computing Jacobians of transformations and 
more details are given in Module 4. 


f 0G ..., Xy) is the joint probability/density function of the random variables 
X, ..., Xy if the following conditions are satisfied: 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. EABAR] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https: //doi.org/10.1515/9783110562545-007 


172 — 7 Joint distributions 


() f(,...,x,) 5 0 for all real values of x,, ..., xy; 

(ii) |. e I f 0, ...,Xy)dxi A+++ ^ dx, =1 if all variables are continuous; 
Σχ," Lx, ΛίΧν...»Χκ) = 1ifall variables are discrete. (Sum up the discrete vari- 
ables and integrate the continuous variables in the mixed case.) 


Example 71. Check whether the following are joint probability functions: f(x,y) is 
given in the table for non-zero values and f (x,y) = 0 elsewhere. 


£ 1 1 2 
y=1 10 10 10 
= 2 1 3 
y=-1 10 10 
_ 3 2 E 
y=2 10 10 10 
6 4 
Sum 10 a0 1 


This can also be stated as follows: 


1 
f(x=0,y=1) - f(0,1) 107 
2 3 1 
ο πο, πο, [ases 
1 2 ; 
f(,-1- τη {1,2) = T and f(x,y)is zero elsewhere. 


This can also be written as 


1 2 

Pr{íx=0,y =1}=—, Pr{x=0,y=-l}=—, 
{ yz 15 { y=-1} 16 
Pr(x 20 ET Pr{x=1 Sie 
CPUS AES ΤΘ 
Pr{íx=1 sajat Pr{x=1 σος 
οκ OR! LET 


and f(x, y) Ξ 0 for all other x and y. 


Solution 71. Since f(x,y) here is non-negative for all x and y and since the total 
y dy f(y) 2 f Gc y) is a joint probability function of two random variables x and y. 


Example 7.2. In Example 71, compute the following: (1) The probability function of 
x alone, which is also called the marginal probability function of x; (2) the marginal 
probability function of y; (3) Pr{x > 0, y > 1}; (4) Pr{x > 0,y x -1)5; (5) Prix +y = 2). 


Solution 7.2. (1) Pr{x = 0} means all probabilities where this condition x = 0 is satis- 
fied. This is available as the sum of the probabilities in the column corresponding to 
x = Oor from the marginal sum, which is £. Similarly, Pr{x = 1} = a Thus the marginal 


74 Introduction —— 173 


probability function of x, namely f; (x), is available from the marginal sum as 


6/10, x=0 
Π(α)Ξ 44/10, x=1 


0, elsewhere. 


(2) Similarly, the probability function of y is available from the marginal sum, given 
by 


2/10, y-1 

3/10, γ--1 
hy) = 

5/10, y=2 


0, elsewhere. 


6) 
3 


1 2 
Pr{x > Gy > -1 = Pr{x =1,y = 1} + Pr{x = 1,y = 2} = — + — = —. 
{ y > -1} =Pr{ y=1}+Pr{ y=2} i8 οι 10 


(4) 


ΡήΧ»0γ«-ΞΡήχ-Ίγ--Ἡ- m 


For computing the probability for x + y, first compute the possible values x + y can take 
with non-zero probabilities. Possible values of x + y are 1, 1,2, 0, 3. 


Pr{x +y =1} = Pr(x 2 Oy =1} ES 
10 
Pr{x +y =-1} = Prix = 0,y =-1} = 
Pr{x +y = 0} = Pr{x = y = -1) 
Pr{x +y = 3} = Prix =1,y = 2} 2 
10 
Similarly, for (5), 
Pr{x +y = 2} = Prix = Ly = 1} + Prix = 0 aya aa 
ως οσο Ὁ 40: 


Example 7.3. For the probability function in Example 71 compute the following: 
(1) Graph the function f(x,y); (2) What is the probability function of x given that 
y--1? 


Solution 7.3. (1) It is a 3-dimensional graph. 


174 — 7 Joint distributions 


(0,2) 3 x (1, 2) 
(0,1) 4 x (1, 1) 
-DF SOS Figure 7.1: Joint probability function. 


At the points marked in Figure 7.1, there are points up at heights equal to the probabil- 
ities or the z-coordinates are the probabilities. 

(2) At y = -1, there are two points (y = --Ί,χ- 0) and (y = -1,x = 1), with the re- 
spective probabilities a at x =O and D at x = 1, thus a total of B. Hence a probability 
function can be created by dividing by the total. 


GG. x=0 
f(x Εἴνεηγ- -1)Ξ 4G/G5». x21 


0, elsewhere 
2/3, x20 
-113, x=1 


0, elsewhere. 


Example 7.4. Evaluate c so that f (x, y), given below, is a density function and then 
compute the marginal density functions of x and y. 


c(x-«y) 0<x<1,0<y<1 
fy) = 
0, elsewhere. 


Solution 7.4. Since f(x, y) is defined on a continuum of points in the square ((x, y) | 
0<x<1,0<y <1} it is a non-negative function if c > 0. The total integral is given by 


[ I c(x +y)dx Ady=c ie | [ια + Ῥάν as 


x=0 


(2) Density of x alone is available by integrating out y from the joint density, which 
is available from the above steps. That is, the marginal density of x is given by 


1 
X+5, O<x<l 
f,)= 2 
0, elsewhere. 


7.1 Introduction 


From symmetry, it follows that the marginal density of y is given by 


1 
y+5, Osy<l 
bo = 2 
0, elsewhere. 


Exercises 7.1 


7.1.1. Check whether the following are probability functions: 


ς 1 2 
(1) f( 11-30 f( απ. ft 1307 p 
2 3 8 
OD. f(0,2)= 75. F(0,3) = 55 and 


f(% y)=0, elsewhere. 
(2) f(-2,0)=1 and 
f(x y)=0, elsewhere. 


(3) [0.1 - /5.2)--5, fO.1= 2. 


and 


f(0,2) = 


allw τί - 


f(%y)=0, elsewhere. 


74.2. Check whether the following are density functions: 


e 3-23 " 
(1) foy-4 Vm ^" -œ <y <œ, 0 xx«1, 

0, elsewhere. 
(2 f(xy) 2630 23. -co«y«oo, 1€x«oo, c-constant 

x,y) = 

0, elsewhere. 

(3) fœy)= que Ie», -co «y < oo, O x X « co 
| > elsewhere. 


— 175 


71.3. If f(x,y) = ce BIW"), Loo < x < oo, —co < y < co is a density then find the 


conditions on a, f, y and then evaluate c. 


74.4. Can the following function be a joint density function (give reasons): 


67, 0xx«oo, 0xy«oo 
f(%y) = 
O, elsewhere. 


7.1.5. Can the following be a density function (give reasons): 
fogy) = cete 09 < X «oo, -00 «y < oo 


where c is a positive constant. 


176 —— 7 joint distributions 


74.6. For Exercise 7.1.1(1) compute the following: (1) Pr{x = -1,γ = 1}; (2) Príx < 0, 
y > 2}; (3) Prix < -2,y = 0}. 


74.7. For Exercise 71.1(2) compute the following probabilities: (1) Pr{x = —2}; (2) 
Pr{y = 0); (3) Prix < -2); (4) Prix = -2,y > 0}. 


7.1.8. A balanced die is rolled 3 times. Let x, be the number of times 1 appears and x; 
be the number of times 2 appears. Work out the joint probability function of x, and x;. 


7.4.9. A box contains 10 red, 12 green and 14 white identical marbles. Marbles are 
picked at random, one by one, without replacement. 8 such marbles are picked. Let 
χι be the number of red marbles, x, be the number of green marbles obtained out of 
these 8 marbles. Construct the joint probability function of x, and x. 


7.1.10. In Exercise 71.9, compute the probability Pr{x, 2 8,x; = 5). 


7.2 Marginal and conditional probability/density functions 


Definition 7.2 (Marginal Probability/Density Functions). If we have f(x;,...,X;) 
as a joint probability/density function of the random variables x,,...,x;,, then the 
joint marginal probability/density function of any subset of x,,...,x;,, for example, 
Xj... X,, T < k, is available by summing up/integrating out the other variables. The 
marginal probability/density of x4, ... , xy, denoted by fi... (x,, ...,x,), is given by 


5-5 
Xa Xk 
when x,,1,--..X; are discrete, and 
Ξ | i | CN RU 
Xa Xk 


when x, ..., Xy are continuous. In the mixed cases, sum over the discrete variables 
and integrate over the continuous variables. 


Notation 7.1. x|y or x|(y = b) = x given y or y is fixed at y = b. The notation is a 
vertical bar and not x/y or A 


gı(xly) = density of x given y; 


ΒΙ(ΧΙΥΞ b) = density of x given y = b. 


Definition 7.3. The conditional probability/density function of x, given y = b, is 
defined by the following: 


7.2 Marginal and conditional probability/density functions —— 177 


fecy) 
81 (xly = b) = 
: hO) ly- 
7 joint density | (71) 
marginal density of y |)_, í 
provided f;(b) + 0. Similarly, the conditional density of y given x is given by 
fy) 
82(y|x = a) = (7.2) 
; ΠΩ) sea 


provided f,(a) #0. 


Example 7.5. Verify that the following f(x,y) is a density. Then evaluate (1) the 
marginal densities; (2) the conditional density of x given y = 0.8; (3) the conditional 
density of y, given x = 0.3, where 


2, O<x<y<l 
fx y)= 
O, elsewhere. 


Solution 7.5. The density is a flat plane over the triangular region as given in Fig- 
ure 7.2. 


The triangular region can be represented as one of the following: 
{(x,y) |O<x<y<l}={(x%y)|O0<x<y&0<y<}]} 
={(x,y)|x<sys1&0<x<il. 


Figure 7.2: Triangular region for a joint density. 


The elementary strips of integration are shown in Figure 72. The marginal density of 
f(x,y) is given by 


1 
Πα) - | fG.y)dy = | 2dy 
y γ-χ 
" s O<x<1 


0, elsewhere. 


178 —— 7 Joint distributions 


The marginal density of y is given by 


fo) = ree Ξ [ 2dx 


x=0 
_ J]J% Osysl 
d O, elsewhere. 
Note that 
1 1 x21! 
ΚΕΠ αλ. 
0 0 2 Jo 
Hence 


| | fenex Ady - 1. 


Therefore, f(x,y) is a joint density function. Here, the ordinate f (x, y) = 2, which 15 
bigger than 1. But the probabilities are the volumes under the surface z = f (x,y) (itis a 
3-dimensional surface) and the ordinate does not matter. The ordinate can be bigger 
or less than 1 but the volumes have to be less than or equal to 1. The total volume here, 
which is the total integral, is 1, and hence it is a density function. 

(2) The conditional density of x given y, g,(xly), is then 


fx y) 
Β8ι(ΧΙ) = 
: hY) 
and 
f(xy) 2 
gi (xly = 0.8) = -- 
: AY) 'y=08) 21-08) 
1 
_ jo» 0<x<08 (1) 
O, elsewhere. 
The conditional density of y given x, is 
fy) 
82(ylx = 0.3) = 
ΠΩ) ἰχ-03 
7 2 K 1 
A πα. 1-0.3 


ap 03 «γς1 
0, elsewhere. 


Observe that in the conditional space x varies from 0 to 0.8, whereas in the marginal 
space x varies from O to 1. Similarly, in the conditional space y varies from 0.3 to 1, 
whereas in the marginal space y varies from O to 1. 


7.2 Marginal and conditional probability/density functions —— 179 


7.2.1 Geometrical interpretations of marginal and conditional distributions 


What are the geometrical interpretations of joint density, conditional densities, 
marginal densities, etc.? 

When f (x, y) > 0 for all x and y and continuous, it means a surface sitting over the 
(x, y)-plane, something like a hill sitting on a plain ground. If f(x, y) could be negative 
also, dipping below the (x, y)-plane then it will be like a ship in the ocean, taking the 
ocean surface as the (x, y)-plane. Then the portion of the ship's hull dipping below 
the water level represents f (x, y) « 0. Our densities are non-negative for all (x, y), and 
hence when there are only two variables x and y then it is like a hill z = f(x, y) sitting 
on the plain ground. 

What is the geometry of the conditional density of y given x = a? 

Note that x = ais a point in 1-space (line), it is a line in 2-space (plane) but a plane 
in 3-space, parallel to the (y, z)-plane. When this plane cuts the hill the whole hill 
tops will be traced on this plane. If the total area under this curve is made unity, then 
this is the conditional density of y given x = a or g;(ylx = a). In Figure 73, we have 
z- Lein’, -00 < X,y < 003 Z = (1+ x)e Y, 0<x<1,0<y< oo. 


1.55 
| 


Z-AXIS 


Y-AXIS 2 2 X-AXIS 


Figure 7.3: Surface cut by a plane. 


What is the geometry of marginal densities? 

Suppose that you use a giant bulldozer and push all the earth and stones of the 
hill from both sides of the (x, z)-plane and then pile up on the (x, z)-plane like a *pap- 
padam". If we assume the total area under this pile-up as one unit, then we have 
the marginal density of x. Similar interpretation for the marginal density of y, pileup 
on the (y, z)-plane. In the discrete case, such pile-ups are already available from the 
marginal totals, as seen earlier. 

In higher dimensions, we cannot see the geometry or visualize but algebraic 
evaluations are possible. In fact, we can only see 3-dimensional objects. We cannot 
see zero-dimensional (point), one-dimensional (line), 2-dimensional (plane), 4 and 
higher dimensional objects. But 0,1,2 dimensional objects can be visualized. You 
see a point, line or the surface of the blackboard only because of the thickness or as 
3-dimensional objects. 


180 —— 7 Joint distributions 


Example 7.6. For the following function f(x,,x,), (1) evaluate the normalizing con- 
stant; (2) the conditional density of x, given x, = 0.7; (3) the conditional probability 
for x, > 0.2 given that x, = 0.7; (4) the conditional probability for x, > 0.2 given that 
Χ < 0.7, where 


Cc(q*x), O<x,<1,0<x <1 


fX) = | 


0, elsewhere. 


Solution 7.6. (1) For evaluating c, we should compute the total probability, which is 
the total integral, and equate to 1. 


1 fl 
ef | (Χ1 + Xy)dx, ^ dx; 


ΧΙΞΟ 4x,=0 
1 
| (x1 + x3)dx; | dx, 


1 
| 
X,=0 L 7x;20 


1 E 1 1 
=c f m+ | dy, =c[ [+5 Jax 
o 2 lo 0 2 


2 
x X 
=c| la 1 


1 
|-6 => ς-ι. 
2 219 


Hence f (x1, x;) is a density since it is already a non-negative function. 
(2) The marginal density of x, is available by integrating out x, from the joint den- 
sity. That is, 


1 Oxzx,x1 
fila) | DR 


X= 


1 
X5 T 3» 

(x, + x3)dx; = 2 
=0 0, elsewhere. 


Hence the conditional density of x, given x,, is given by 


F(X, Xp) 2 (Χι + X3) 
fio) Χ2 + I 


gy 6o) = 


for 0 < x, x 1and for all given x;. Therefore, 


χι 0.7 
5 125 O<x,<1 
X=0.7 


0, elsewhere. 


XQ 
gy lx; = 0.7) = 4—2 


1 
x»t5 


Note that the ranges of x, and x, do not depend on each other, not like in Example 75, 
and hence in the conditional space also x, ranges over [0,1]. [The student may ver- 
ify that σι(Χι|Χχ2 = 0.7), as given above, is a density function by evaluating the total 
probability and verifying it to be 1.] 

(3) Pr{x, > 0.2|x, = 0.7} is the probability for x, > 0.2 in the conditional density. 
That is, 


7.2 Marginal and conditional probability/density functions —— 181 


1 1 
Prix, > 0.2|x, = 0.7} = | 810% ΙΧ; = 0.7)dx, = | χιτ 0.7 i 
0.2 


o2 1.2 
1 


χ2 1 
E E + 07x, = 0.87. 
12.2 ud 


(4) This does not come from the conditional density gj (x;|x;). Let A be the event 
that x, > 0.2 and B be the event that x, < 0.7. Then 


ANB = {(x1,X>) |0.2< x, < 1,0 < x; «0 7]. 


These events are marked in Figure 74. 


Figure 7.4: Events A,B, An B. 


Probabilities of A and B can be computed either from the marginal densities or from 
the joint density. 


1 0.7 
P(AnB)- | (χι + Xy)dx, ^ dx; 
X,=0.2 4x,=0 


1 0.7 
| n 0G expe [an 
X,=0.2 | 7x;20 


1 2 10.7 1 
| [ax + 2] dx, = | [or + 27 |e 
2 lo 2 


X,=0.2 x,-02 
2 1 
- [07% + ox] = 0.532 
2 0.2 
1 07 
P(B) = Pr{0 x x9 < 0.7} = | | (χι + x;)dx, ^ dx; 
X170 Jx,=0 
07 1 2 07 
-[ ΠΕ 
χ-0 2 2 2 0 
= 0.595 


Therefore, 
P(AnB) 0.532 
Ῥ(Β) 0595. 
From the definition of conditional probability/density function, we may note one 
interesting property: 


Ρ(Α|Β) = 


182 —— 7 Joint distributions 


fy) 

hy) 
This shows that we can always have a decomposition of the joint density as product 
of conditional and marginal densities. 


81(xly) = fy) #0. 


foy) =gh) Aly) #0 
-θρίγ|χ)ῃο), fo) #0 (73) 


where f, (x) and f;(y) are the marginal densities and σι(χ|γ) and g,(y|x) are the condi- 
tional densities. If there are k variables x,, ... , x, and if we look at the conditional joint 
density of x4, ... , x, given x, 4, ..., Xx» then also we have the decomposition 


f. ...,Χχ) = Είκι...»Χγ|Χγη»...»Χκ)ΡίΧγη»...»Χκ) (74) 


where f;(x, , ..., Xp) # 0. Another observation one can make from (7.3) and (7.4) is that 
if the conditional probability/density functions do not depend on the conditions or 
free ofy and x, that is, if g; (x|y) = fi (x) = the marginal probability/density of x itself and 
g>(ylx) =f,(y) = the marginal probability/density of y itself, then we have the product 
probability property 


foy) - ΠΟ0)Ρ{ν) 


and this is also called statistical independence of the random variables x and y. 


Exercises 7.2 


7.2.1. Compute (1) marginal probability functions; conditional probability functions 
of (2) x given y = 1; (3) y given x = -1, in Exercise 7.1.1 (1). 


7.2.2. Compute the marginal and conditional probability functions, for all values of 
the conditioned variables, in Exercise 7.1.1 (2). 


7.2.3. Construct the conditional density of y given x and the marginal density of x in 
Exercise 7.1.2 (1). What is the marginal density of y here? 


7.2.4. Repeat Exercise 7.2.3 for the function in Exercise 7.1.2 (2), if possible. 


7.2.5. Repeat Exercise 72.3 for the function in Exercise 7.1.2 (3), if possible. 


7.3 Statistical independence of random variables 


Definition 7.4 (Statistical independence). Let f(x, ...,x,) be the joint probability/ 
density function of the real random variables x, ... , x; and let fi (x). f? 66). ... . fi (x) 
be the marginal probability/density functions of the individual variables. Let them 
satisfy the condition 


7.3 Statistical independence of random variables —— 183 


F-X) =A) + Ae %) (7.5) 


for all x,,...,x, for which f;(x;) #0, i 2 1, ..., k then x, ..., x, are said to be statisti- 
cally independently distributed or x,, ... , x, are said to satisfy the product probability 
property. 


Because of the term “independent” this concept is often misused in applications. 
The variables depend on each other heavily through the product probability prop- 
erty (7.5). The phrase “independent” is used in the sense that if we look at the con- 
ditional probability/density function then the function does not depend on the condi- 
tions imposed. In this sense *independent of the conditions" or does not depend on 
the observation made. Variables may not satisfy the property in (7.5) individually but 
in two sets or groups the variables may have the property in (75). That is, suppose that 


f... x) =A OG s x pape Xy) (76) 


then we say that the two sets {x,,...,x,} and {x,,,,...,x;,} of variables are statistically 
independently distributed. If the property (7.6) holds in two such sets, that does not 
imply that the property (75) holds on individual variables. But if (75) holds then, of 
course, (7.6) will hold, not vice versa. 


Example 7.7. If the following function 


εαν exco] =1,2,3 


0, elsewhere 


f605,X5,x3) = | 


is a density function then compute (1) c; (2) density of x, given x, = 5, x3 = 2; (3) prob- 
ability that x, < 10 given that x, = 5, x; - 2. 


Solution 7.7. 
oo foe) o 
1-c | | | eX MGdx, A dx, A dx; 
X120 Jx,=0 4x3=0 
o [9.4] foe) 
=c | | enn | e *5dx, |dx, ^ dx). 
X170 Jx,=0 X3=0 
But 


© 1 Sx ou 
| e “Sdx, = [-ze™| =<, 
0 4 o «4 


Similarly, the integral over x, gives I and the integral over x; gives 1. Hence 
jm 
4x2 
(2) Here, we want the conditional density of x, given x, and x5. The joint marginal 
density of x, and x; is available by integrating out x, from f (4,5, x3). Denoting by 


c=8. 


184 —— 7 Joint distributions 


f3300,X3) = ie 


x,=0 


CO 
8e ?n7e dx] = ge 7o ^os | e dx, 
0 


| J4e**5, 0<x;<00, j=2,3 
0, elsewhere. 


Hence the conditional density of χι, given χ2,χ., denoted by g,(x;|x>,x3), is available 

as 

f05:2.X3) 

fo3(%» X3) X,=5,X3=2 

_ Sexp[-2x; - x; - 4x3] 
4 exp[—x; - 4x3] 


8168 b6,x3) = 


X325,4472 


-2e "5. dex «o0. 


In other words, this function is free of the conditions x, = 5, x3 = 2. In fact, this condi- 
tional density of x, is the marginal density of x, itself. It is easily seen here that 


FXXX) = ALO) Oa) f 0) 


the product of the marginal densities or the variables x,,x,x3 are statistically inde- 
pendently distributed. 


Statistical independence can also be defined in terms of joint distribution func- 
tion, joint moment generating function, etc. We have defined the concept assuming 
the existence of the probability/density functions. In some cases, the moment gener- 
ating function may be defined but the density may not exist. Lévy distribution and 
singular normal are examples. 


Definition 7.5 (Joint distribution function). The cumulative probability/density 
function in x;,..., x, is denoted by Fy, (a,,...,a,) and it is the following: 


Ες x (40) = Pra < a... X < αχ} (77) 


for all real values of a, ..., αχ. 


If the right side of (77) can be evaluated, then we say that we have the joint dis- 
tribution function for the random variables x,,...,x,. In terms of probability/density 
functions 

F, o p ++ 9 Ag) = Σ er Σ F(X, Xx) 
—00«X,€04 —00«X, Sd, 


when xX, ..., x; are discrete, and it is 


x i PP Fon eames Av di (7.8) 


^u 5 
when x, ...,xy are continuous. For the mixed cases, sum up over discrete variables 
and integrate over the continuous variables. 


7.4 Expected value ---- 185 


Note that when the random variables x,, ... , x, are independently distributed, that 
is, when the joint density is the product of the marginal densities then the joint distri- 
bution function factorizes into product of individual distribution functions. That is, 


2 (a, séis ax) = Fy (αι) e Fa (αι) (79) 


where F, (αι) is the distribution function or cumulative probability/density function 
of x;, evaluated at x; = aj, i = 1, ..., k. One can use (79) as the definition of statistical 
independence then the joint distribution function becomes the product of the individ- 
ual distribution functions. Then from (79), (75) will follow. Example 77 is a case where 
the variables are independently distributed. 


Exercises 7.3 


734. If f 6,x5,x3) = c0 +X + X3), 0 x; < 1, i = 1,2,3 and f (x1,X2,X3) = 0 elsewhere 
is a density function then (1): evaluate c; (2): show that the variables here are not 
independently distributed. 


7.3.2. If the following is the non-zero part of a joint density then evaluate (1) the nor- 
malizing constant c and list the conditions needed on the parameters; (2) the marginal 
densities of x, and x; (3) show that x, and x, are not independently distributed. 


f 04,35) = ext x (1 -- χι x5) 
forO <x, <1,0<x)<1,x,+x <1 
7.3.3. Do the same Exercise 7.3.2 if the function is the following: 
FX) = exp ad? (1 x, + x5) (tte) 
for O < x, < œ, 0 x x; < co. 
7.3.4. Do the same Exercise 7.3.2 if the function is the following: 
fX) Od (x xS) ox -1)571 
forü0xx,x1,0xx,x1, x +X% <1. 
7.3.5. Do the same Exercise 7.3.2 if the function is the following: 
f 0433) = exp Ga x) e xy x3) 


for 0 < x, «oo, 0 < x; < co. 


7.4 Expected value 


The definition of expected values in the joint distribution is also parallel to that in 
the one variable case. Let f (x4, ... x) be the joint probability/density function of real 
random variables x;, ... , x. Let x4, ...,x,) bea function of x, ..., xy. 


186 —— 7 joint distributions 


Notation 7.2. E[w(x,, ...,x;,)]: expected value of ψίχι,...,Χχ). 


Definition 7.6 (Expected value). 


Epor M S Ὁ σσ 


-οο«χι«οο —00<X <00 


if x, ..., x, are discrete; 


- |. ep Ot, ... Xj) f Kyo... KAKA + A AX, 


if X3, ..., xy are continuous. For the mixed cases, integrate over the continuous vari- 
ables and sum up over the discrete variables. 


Example 7.8. Compute (1) E(xy); (2) E{[x - E(x)]Ly - E(y)]}; (3) E[x] for the following 
probability function: 


1 2 
f(0,—1) = 5 f(0,1) = 5 

1 1 
Γης nup 


and f (x, y) = 0 elsewhere. 


Solution 7.8. Since both the variables x and y here are discrete we will sum up. 

(1) Expected value of xy means, take the values that x can take with non-zero prob- 
abilities and the corresponding y values, multiply together and then multiply by the 
corresponding probabilities and add up all such sums. For example, when x takes the 
value 0 and y takes the value -1 the corresponding probability is H Hence this term 
is (0)(-1)(5) = 0. That 15, Ε(χγ) = (0)(-1)(5) + OOG) + DDG) + (X(0) + 0 = 0. 
Thus E [xy] in this example is zero. 

(2) For computing the second expected value, we need E[x] and E[y]. We can com- 
pute these from either the marginal probability functions or from the joint probability 
function. The marginal probability function of x is given by 


3/5, x=0 
μα - Σα) - 12/5, x=1 


γ 
0, elsewhere. 


Therefore, 


maol) oG) 


If this is to be computed by using the joint probability function, then take one x value 
and then add up all the corresponding probabilities, multiply and add up. That is, 


7.4 Expected value —— 187 


Ep] = Y Zy} Zy] = 369 
y x 


EP) 
o(a- 


Similarly, E[y] = (-1)(2) + (1)(2) = 5. Now, we can compute 


E(Ix - Ecolly - Eon] =f fx- =] |y- =] 


Now put a value x takes, put the corresponding y value and multiply by the corre- 
sponding probability and add up all such quantities. 


eb 21-3] 2 zb- Ib eo 


which is also equal to 


Eby] - EoE9) -o- (2)( 2) = 2 


in this case. In fact, this is a general result. 


Notation 7.3. Cov(x, y): covariance between x and y. 


Definition 7.7 (Covariance). The covariance between two real scalar random vari- 
ables x and y is defined as 


Cov(x, y) = E([x - E()][y - ΕΦ) = Ely] - EEY). (710) 


The equivalence can be seen by observing the following: Once the expected value is 
taken it is a constant (does not depend on the variables any more); expected value of 
a constant times a function is constant times the expected value; expected value of 
a sum is the sum of the expected values as long as the expected values exist. These 
properties, analogous to the corresponding properties in the one variable case, will 
follow from the definition itself. Opening up 


[x — Wy] ly - H2] = xy - ay - uox + po 


where μι = E(x) and u, = E(y) are constants. Now taking the expected values we have 


188 — 7 Joint distributions 


Cov(x, y) = E([x - Eco] [y - Ε)]} 
= E(Ix - uilly - 421] = Etxy - py - ox + po) 
= E(xy) - p} Ey) - p} E(x) + uu 
= E(XY) - 4o — Malo + Map 
= E(xy) - uu = E(xy) - Ε()Ε(ν). 


This result holds when both x and y are continuous, both are discrete, one is continu- 
ous and the other is discrete, whenever the expected values exist. 


What is this measure of covariance? We have seen that in the one variable case, 
standard deviation can be interpreted as a “measure of scatter" or “dispersion” in the 
single variable x, and the square of the standard deviation is the variance. Analogous 
to variance, covariance measures the scatter or joint dispersion or angular dis- 
persion or joint scatter in the point (x, y) or the joint variation of the coordinates 
x andy, so that when x - y then the covariance becomes the variance in x. For ex- 
ample, if (x, y) = (ΧΙΙ), ..., (xy, Yp) with probabilities : at each point then E(x) - x and 
E(y) = y, Cov(x,y) = Èj - X); - y)/n but if the angle between the two vectors is 0 
then 


cos@= zu GOV — (711) 


JVar(x) JVar(y) 


Thus, covariance measures the angular dispersion between the random variables x 
andy 


Example 7.9. For the following continuous case evaluate (1) Cov(x, y); (2) E(x? + 
2 
Xy - y?) 


2, O<xsy<l 
fy) 
O, elsewhere. 


Solution 7.9. This density function was handled before and the marginal densities 
were already evaluated. For computing E(x?) and E(y?), we will use the marginal den- 
sities, which are 


Πα)-21-1), O<x<1 
and zero elsewhere, and 
hy)-2y O<y<1 


and zero elsewhere. Hence 


E(x’) = | xf (x)dx = ΚΤ - x)dx 
x 0 


7.4 Expected value —— 189 


and 


E63) = | (Arway = | 0? 2008 


411 
ZU 
41, 2 

For computing E(xy), we have to use double integrals. Either we can integrate out y 


first from x x y x 1 and then x from 0 x x x 1 or integrate out x first from O < x x y and 
then y from 0 x y < 1. Therefore, 


1 
Ε)- | Ὁ) f f acd lay 
y= x= 


1 4 41 1 
= ο zz 

| ybl αι 
Now we need E(x) and E(y). 


1 
E(x) = | xf, (x)dx = |; (Χ}2{1 -- x)dx = ; 


and 
1 2 
ΕΟ) = | honey =| venay=5. 
Therefore, (1) 


Ον) = Ey) - BWEY) = > -(3)(5)= =. 


and (2) 


E[x? + xy -γ᾽] = E(x’) + E(xy) - E(y’) = 


7.4.1 Some properties of expected values 


Some properties parallel to the ones in the one variable case are the following. These 
follow from the definition itself. 
Result 7.1. 
E(c)=c 


where c is a constant, with respect to the variables x, ... , x, for which the joint distri- 
bution is used to compute the expected values. 


190 ---- 7 Joint distributions 


Result 7.2. 
E[cy a, ....X,)] = cE[ 0, ....x4)] 


whenever the expected value exists, where c is a constant. 


Result 7.3. 
Elan, 6, ... xy) + Ρψρίκι»...»Χκ)} = αΕ[ψιίχι» ...»χκ)]} + BE [s 06, ...»χχ)] 


whenever the expected values exist, where a and b are constants, and ψι and p, are 
two functions of xi, ... Xy. 


[Here, we have a linear combination of two functions. The result holds for a finite 
number of such linear combinations but need not hold for an infinite sum.] 


Result 7.4. Let pi (x1), 506), ... Wi (xy) be functions of xi, ... x; alone, then the ex- 
pected value of a product of finite number of such factors is the product of the ex- 
pected values, when the variables are independently distributed. That is, 


Ε[ψιίχιψοία;)-:ψιίκο]- Εἰφιαν]ΕΙφα))]"' Εἰψκαρ] 


whenever χι»... xy, are independently distributed. 


The proof is trivial because when the variables are independently distributed the 
product probability property will hold and the joint probability/density function will 
factorize into product of individual probability/density functions then the sum or in- 
tegral will apply to each factor. 

One consequence of this result is that if the variables are independently dis- 
tributed then the covariance between them is zero but the converse need not be 
true. Covariance being zero does not imply that the variables are independently dis- 
tributed. 


Result 7.5. When x and y are independently distributed, Cov(x,y) = O, but the con- 
verse need not be true. That is, 


Cov(x, y) 2 0 
when x and y are independently distributed. 
The proof is trivial. When the variables are independently distributed the expected 
value of a product is the product of the expected values. Therefore, 
E[(x - E@))(y - EY))] = E[x - E%WJE[y - EQ]. 


But for any random variable E[x -- E(x)] = E(x) - E(x) = 0, and hence the result. 


7.4 Expected value —— 191 


Example 7.10. For the following joint probability function, show that Cov(x,y) = 0 
but the variables are not independently distributed. 


3 2 1 
f(0,-1)- 10’ f(0,1)= 10’ f(a, l= 10 


1 3 
ο ο, πες 


B f(2,1) - 0, 


and f (x, y) = 0 elsewhere. 


Solution 710. The marginal probability function of y is given by the marginal sum, 
that is, 
1/2, y=-1 
ho)= 41/2, y=1 


0, elsewhere. 


Hence E[y] = (-1)5 + (1)5 = 0. Also 
Evy) = (0)(-1( 2) + owl 2)+mcen(4 
10 10 10 
+ (2)(1)(0) + exco) + oo(= =0. 


But, for example, 


1 4 1 
Prx-iy--1-—; Prx-1-2—; Ρήγ--1--. 
αρ σπορ) ΣΠ, 


Therefore, Pr{x = 1,y = —1) + Pr{x = 1} Pr{y = —-1}, and hence the variables x and y are 
not independently distributed. 


Note 7.4. Ifthe product probability property does not hold even for one point, then 
the variables are not independently distributed. Hence to show that two variables 
are not independently distributed, one has to come up with at least one point where 
the product probability property does not hold. 


7.4.2 Joint moment generating function 


Notation 7.4. M(T) = M(t, ...,t,): Joint moment generating function. 


Definition 7.8 (Joint moment generating function). The joint moment generating 
function is defined as the expected value of a linear function in the exponent. That 
is, 


192 —— 7 Joint distributions 


M(t, ee stk) = [etat +t] 


where {ῃ,..., t, are arbitrary parameters, provided the expected value exists. 


From the joint moment generating function M(t,,...,t,), the individual moment 
generating functions (mgf) are available by putting the other {5 zeros. For example, 
the joint mgf of the first r variables, r < k, is available by putting t,,, 20 = + = tk, 
or it is M(t,,...,t,,0,...,0). Also the joint integer moments, usually known as prod- 
uct moments, of the type E[xj X] where rj = 0,1,2, ..yJ=1,...,k are available from 
M(t,,...,¢,) by expansion, when M(t,,...,t,) admits a power series expansion or by 
differentiation when M(t, ...,t,) is differentiable. That is, 


" á ont +h 
E(xq) xi) = pe ME suot) 
oti --- ot; t,=0,....t,=0 
n... Εκ 
= coefficient οἳ -------ᾱ- 
nr 


in the expansion of M(t;, ... , tg) around (t; = 0,...,t, =0), for rj 0, L..., j z L....k. 

Sometimes it is convenient to use the vector, matrix notation. Let the prime denote 
a transpose and let T' = (t4, ... , t) then we may also represent the joint moment gen- 
erating function as either M(T) or M(T’) according to convenience. One immediate 
consequence of statistical independence is that the joint moment generating function 
will factorize into individual moment generating functions. This will be stated as a re- 
sult but it follows from the fact that when the variables are independently distributed 
the joint probability/density or joint distribution function factorizes into individual 
probability/density or distribution functions. 


Result 7.6. When the real random variables x,, ... ,x, are independently distributed, 
then the joint moment generating function, when it exists, factorizes into the product 
of individual moment generating functions. That is, 


Mity st) = M, (&) My, (ty) (712) 


where Mx (Ὁ, j=1,...,k are the individual moment generating functions. 


One may use (7.12) as the definition of statistical independence. Then the product 
property of probability/density functions and distribution functions can be shown to 
follow from (712). Thus either one can define statistical independence through prob- 
ability/density functions, distribution functions, moment generating functions, char- 
acteristic functions, etc. 


Note 7.5. Thejoint characteristic function is available from the joint moment gener- 
ating function formula by replacing t; by itj, i= Y -1,j 2 1, ..., k. Moment generating 


7.4 Expected value —— 193 


function needs not exist all of the time but the characteristic function will always 
exist. 


Note 7.6. The joint Laplace transform is available by replacing t; of the joint mo- 
ment generating function by —t;, j = 1,...,k for positive variables x; > 0, j =1,...,k. 
The joint Mellin transform for positive variables is available as the expected value 
Epa ! x? 1], whenever it exists, when x, ... x, are positive random variables, 
where s;,..., s, are complex parameters. 


Example 7.11. Let x,,...,x, be independently distributed normal variables with pa- 
rameters (μι, 02), ... , (Mj, 07). Let a}, ...,a, be real constants. Find the distribution of 
the linear function u = a,x, +--+ + A,X. 


Solution 7.11. Let us compute the moment generating function of u if it exists. 


Μ, (8) = Efe™] = Ε[αί(αιχι;'"“ακρ] 
k 
= | As, (at) 
j=l 


due to product probability property (PPP) or statistical independence of the variables, 


where M, (t) is the moment generating function of x;. Since x; is normally distributed 
X j j 


we have a;x; ~ N(auj, a?o?) and, therefore, 


E 
M, (at) = exp| tay, + 54 σί j| 


Hence the product becomes 


k 2 
M,(t) = [ew [an * ajo} |} 


of (on) άν) 


which is the moment generating function of a Roe variable. Therefore, u is nor- 


mally distributed with the parameters PX Hj; = E[u] and ΣΕ i aj oj = = Var(u). 


Definition 7.9 (Independently and identically distributed variables (iid) or a simple 
random sample). A collection x,,...,x, of real random variables, which are inde- 
pendently and identically distributed with the common probability/density func- 
tion f (x), is called a simple random sample of size n from the population designated 
by f(x). 


194 —— 7 Joint distributions 


A population can be described by a random variable, such as “a normal popula- 
tion", *an exponential population", *a Bernoulli population", etc., or by a probabil- 
ity/density function or by a distribution (cumulative probability/density) function or 
by a characteristic function, etc. Thus a simple random sample is a collection of ran- 
dom variables and not a set of numbers, as often misinterpreted. The phrase “simple” 
is used because there are also other types of random samples such as systematic sam- 
ples, multistage samples, stratified samples, etc. Here, we are considering only simple 
random samples. 

When the variables satisfy PPP or are independently distributed the joint proba- 
bility/density function (if it exists) is the product of the marginal probability/density 
functions, or the joint distribution function (our random variables are defined in terms 
of distribution functions) factorizes into product of individual distribution functions 
or the joint characteristic function factorizes into product of individual characteristic 
functions (always exist) or the joint moment generating function (if it exists) factor- 
izes into individual moment generating functions. Identically distributed means that 
the functional forms and the parameters are the same for all the probability/density 
functions, distribution functions, etc. 

As an example, if x, and x, are iid exponentially distributed variables then the 
joint density, denoted by f (x4, x;), is given by 

1 πα 
/ίΑι» Χχ) = gi? 9 8 
for 0 xx4« œ, 0 < x; < coo, 0 20, and zero elsewhere. Thus the functional forms and 
the parameters in x, and x, are the same. Of course, the variables x, and x, are two 
different variables and do not put x, = x; =x. 


Example 7.12. Let x,,...,x, be iid variables. Let u = χι + --- +x, and v = X = 
Then show that (1) if Xj is gamma random variable then u and v are gamma random 
variables; (2) if Xj is normal random variable then u and v are normal random vari- 
ables, or gamma and normal variables are infinitely divisible. 


XpbeX, 
σας, 


Solution 712. Let M(t) Ρε the moment generating function of x;. If the variables are 
iid, then all have the same moment generating function M(t). Then 


n 
Mt) = | a, due to statistical independence 
jel 


-[M(t)]" dueto identical distribution. 


When x; is gamma distributed with the shape parameter a and scale parameter β, then 
the moment generating function 


M(t) =(1- pt)“ 


and, therefore, 


7.4 Expected value —— 195 


(M(0]" = (1 - pt) "^ 


which means that u is gamma distributed with parameters na and β. Also note that if 
yisa gamma random variable with the moment generating function 


My(t) = (1- ft) 


then 


Ble 


My(t) = [{My(t)}* ]" 
where My(t) -(1- pt) ^, (M, (t)}"/ n-(1- [RE are both moment generating functions 
of gamma variables. In other words, a given gamma variable y can be looked upon as 
the sum of iid gamma variables. In other words, a gamma variable is infinitely divisi- 
ble. From the same procedure above, note that v is also gamma distributed. 

(2) If χι-Νίμ, 0?) (normally distributed) then 


2 
M, (0) = [M, ()]" = οσον σα 


which means that u is normally distributed with the parameters nj and no?. On the 
other hand, 


M,( = [[M, (0)*]" 


where (My, ()s is again the moment generating function of a normal with parame- 


ters E and r In other words, a normal variable can be decomposed into sum of iid 
normal variables and hence a normal variable is “infinitely divisible”. From the same 
procedure above, note that v is also normally distributed. 


Definition 7.10 (Infinite divisibility). If a real random variable can be decomposed 
into the sum of iid variables belonging to the same family, then the variable is said 
to be infinitely divisible. Equivalently, let $, (t) be the characteristic function of x 
and if [$,(t)] n is also a characteristic function belonging to the same family of func- 
tions as $,(t), then x is said to be infinitely divisible. 


The concept of infinite divisibility of random variables is very important in prob- 
ability theory, stochastic process, time series, etc. 
7.4.3 Linear functions of random variables 
In many applications, we need the distribution of linear functions of random vari- 


ables, and hence we will consider variances of linear functions and covariance be- 
tween linear functions. Let x;, ..., x, be k random variables with Var(x;) = oF = gj and 


196 —— 7 Joint distributions 


covariance between x; and x; denoted by ση and the matrix of variances and covari- 
ances by X = (σι). Let u = 44X4 + +++ + AmXm and v = b,x, + +++ + ΌῃΧῃ where a, ..., d, 
b,,...,b, are constants, something like a, = 1, a = -3, a4 = 2, m = 3 so that u = x; - 
3x; + 2x3; b4 =1, b; = —1, n = 2 so that v = x, — x;. Before looking at the variances, let us 
examine some of the representations possible. 


= —g'x—vx! 
U= 4X, *:tüyXQ-aX-Xa 


where a prime denotes the transpose, a and X are column vectors, whose transposes 
are 


a! -(αι,...»αρ), X! = 8... X). 
Also, recall the square of a sum and its various representations 


k us k k 
ba =(4+ += Y ac 


j=l i=1 j=1 


cj +29 ας (713) 


where, for example, Σορ means the double sum over i and j subject to the condition 
i » j. Now let us see what are the variances of u and v and what is the covariance be- 
tween u and v. We use the basic definitions. By definition, 


Var(u) = E[u - E(u)]’ = Ela! (X - EX]. 


Since a’ [X - E(X)] is a scalar quantity, it is also equal to its transpose. Therefore, we 
may write 
E[a' (X - EGO] = E[fa (X - EQO)Ha' (X - E(X))}"] 
= E[{a! (X - EQO)HX - EQOJ'a] 
-a' E((X - E(X))(X - E(X))' Ja = αἱ Σα. (714) 


That is, the variance of a linear function is a quadratic form: 


σι σι Omm || 

Uo On ~ d a 
Vai(a'X)-a'Za-[ay..,a4]| Ados (715) 

Om Om? ος Omm] Lam 


Writing it by using (7.13) or without using matrix notation, we have the following: 


m 
Var(u) = Σ aj Var(x;) +2 Σ αιαι Cov(x;;X;) 
j=l 


i<j 


7.4 Expected value ---- 197 


Ms 


2 
Gj σι +2 Σ αιαιση 
i<j 


- 
Il 
= 


Ms 


a? Var(x;) when x4, ... ,Xm are non-correlated; 


~ 
Il 
= 


2 
Var(ž) = >, - Var(x;) = - =Var(x) when x;,...,X, are iid. (7.16) 
j=l 


Those who are familiar with quadratic forms may use the most convenient form in 
(7.14), and others may use the form in (716). Note that the matrix X in (7.15) is at least 
positive semi-definite, since it is coming from the form BB' for some matrix B. 

Let y4, ..., y, be another set of variables and let v = byy, + --- + b,y, = b' Y with 
b! = (b, ..., b,) and Y' = (y,,...,y,). Then from (714) it follows that 


Var(v) = Var(b' Y) = b'X,b (717) 


where X, is the covariance matrix in Y. The covariance between u = a' X and v = b' Y 
is then available as 


Cov(u,v) = E[((u - E(u))((v - E(v))] = α’ E[(X - E(X))(Y - E(Y))]b' 
- a! Z, b! (718) 


where Σχν is the m x n matrix 
χι - E(x) 
E : [y1 -E( soy, -EYn)] (719) 
Xm — Ε(Χηι) 
and if Cov(v, u) is considered then we have 


Cov(v,u) = b' 25d: (7.20) 


When we have two linear forms in the same variables x,,...,x;,, that is, u = ax, ++ + 
ayXy and v = b,x, + --- + byx, then the covariance between u and v is available from 
(717) by putting X = Y, m2 n- k or 


Cov(u, v) = αἲ £b = b'Xa. (7.21) 


Example 7.13. Compute the (1) Variance of u; (2) Variance of v; (3) covariance between 
u and v, where u = 2x, — X), V=X, +X, Var(x,) = 1, Var(x,) = 2, Cov(x,, x5) - 1. 


Solution 7.13. (1) 


Var(u) = 4 Var(x,) + Var) - 4 Cov(x,, x5) = 4(1) + (2) - 4(1) = 2. 


198 —— 7 Joint distributions 


(2) 
Var(v) = Var(x,) + Var(x,) + 2Cov(x,, xX) = (1) + (2) + 2(1) 2 5. 


The covariance matrix in this case is 


Zr 


and hence the covariance between u and v, by using the formula 


η. 


This can also be computed as 


Cov(a, X2) = Cov(2x4 — X5, X + X2) 
= 2Var(x4) + 2Cov(x,, x5) - Cov(,, X2) - Var(x,) 
Ξ2(1) + 2(1) --(4) --(2)Ξ1. 


What are the answers if Cov(x4, X2) = 2? 


Note 7.7. If the students have tried to answer the question at the end of Solu- 
tion 7.13, then they may have noticed that it is not possible to have a covariance 
between x4, X as 2 if the variances are 1 and 2. The reason is that the covariance 
matrix, or the variance-covariance matrix has to be at least positive semi-definite 
when real. If we put Cov(x,,x,) in the above example as 2, then the diagonal ele- 
ments are positive but the determinant is negative and the matrix is indefinite and 
hence it cannot be a covariance matrix. 


Sometimes the following notation is also used in the literature. 


Notation 7.5. When X isap x 1 vector the notation Cov(X) means the variance- 
covariance matrix or covariance matrix in X, which we already denoted by = = (ση). 


Definition 7.11 (Correlation coefficient). A scale-free measure of covariance be- 
tween two real scalar random variables x and y is called correlation coefficient and 
it is denoted by p or p, and it is defined as 


Cov(x,y) 
--------------. SG) EO), αρ) (7.22) 
p γγαι(χ) Var(y) % 
When the covariance is divided by the standard deviations, then the measure has 
become scale-free because covariance is available only in terms of the units of mea- 
surements of x and y. Since covariance measures joint scatter in x and y or the scatter 


7.4 Expectedvalue —- 199 


or dispersion in the point (x, y), this correlation also can measure only the joint varia- 
tion. Because ofthe phrase "correlation" people think that p can measure relationship 
between x and y. This is wrong. It cannot measure relationship between x and y. 


7.4.4 Some basic properties of the correlation coefficient 


Note that the correlation coefficient is defined only for non-degenerate random vari- 
ables. 


Result 7.7. Whatever be the non-degenerate random variables x and y, 


-1<p<1. (7.23) 
The proof is very trivial. Consider two new random variables u = ai + τ and 
v= δ τ F where σι and o, are the standard deviations of x and y, respectively. 
Var(u) = var( =) + va(2-) + 20ον|.5., x) 
σι 0? σι 0) 
V V C > 
2 a) R a) ien ov(x,y) STELE 
σι 0; 010? 
=2(1+p) 


since Cov(x, y) = 90,0). But the variance of any real random variable is non-negative. 
Hence 2(1+ p) 20 > p > -1. Similarly, Var(v) 22(1- p) 2 O—Àpx10r-1xp«x1. 


Result 7.8. Let u = ax + b, a + O and v = cy + d, c 0 where a,b,c,d are constants. 
Let p, and ριν denote the correlation coefficient between x and y, and u and v, re- 
spectively. Then 


Duy — laci» = +Pyy- 


Result 7.9. 
Pxy=+1 ifandonlyify - ax « b, a +0, 
à linear function of x almost surely. 


Thus the only value of p, which can be given a physical interpretation, is for p = +1 
and no other point can be given a physical interpretation. Since 


lo|x1 = Cov(x,y)< 0,05 


200 — 7 Joint distributions 


where σι and g, are the standard deviations. [This is also Cauchy-Schwarz inequality. | 
Thus if the standard deviations are known then the covariance cannot be arbitrary, 
it has to be less than or equal to the product of the standard deviations. This is an 
important point to remember in practical applications. 


Exercises 7.4 


7.4.1. Let x ~ N(0,1), a standard normal variable. Let y = a + bx + cx?, c#0 bea 
quadratic function of x. Compute the correlation between x and y here and write it as 
a function of b and c. By selecting b and c show that, while a perfect mathematical 
relationship existing between x and y, as given above, p can be made zero, very small 
|p|, very large |p| (nearly —1 or 1, but not equal to +1). Thus it is meaningless to interpret 
relationship between x and y based on the magnitude or sign of p. 


7.4.2. By using Exercise 74.1 show that the following statements are incorrect: “p > 0 
means increasing values of x go with increasing values of y or decreasing values of 
x go with decreasing values of y"; “p < O means the increasing values of x go with 
decreasing values of y or vice versa;" “p near to 1 or —1 means near linearity between 


x and y". 
7.4.3. Compute (1) covariance between x and y; (2) E[xy?]; (3) p for the following dis- 
crete probability function: 
1 2 1 1 
0,-1) = -, 0,1) = =, 1,-1) = =, 1,1) =— 
Τί ο τ᾽. E 
and f (x, y) = 0 elsewhere. 
7.4.4. Compute (1) Cov(x, y); (2) Ε[χ}γ”]; (3) p for the following density function: 
f(x y)=1, O<x<1,0<y<l 
and f (x, y) = 0 elsewhere. 


7.4.5. Let x,,...,x, be a simple random sample of size n from a gamma population 
with parameters (a, B). Let x = (x, + -+ + x,)/n. (1) Compute the moment generating 
function of x; (2) Show that YR as well as x are gamma distributed. (3) Compute 
x-Elx] 

JVar(X) * 

7.4.6. (1) Show that u in Exercise 74.5 is a re-located, re-scaled gamma random vari- 
able for every n. (2) Show also that when n — oo, u goes to a standard normal variable. 


the moment generating function of u = 


7.4.7. Going for an interview consists of f, = time taken for travel to the venue, t; = 
waiting to be called for interview and t; = the actual interview time, thus the total 
time spent for the interview is t = f, + t, + tj. It is known from previous experience that 
tj, b t, are independently gamma distributed with scale parameter f = 2 and E[t,] = 6, 
E[t] = 8, E[t,] = 6. What is the distribution of t, work out its density. 


7.5 Conditional expectations —— 201 


7.4.8. Let xj,x; be iid random variables from a uniform population over [0,1]. Com- 
pute the following probabilities without computing the density of x, + x5; (1) Prix, + 
x, < 1}; (2) Pr(x < $}; (8) Prod exi < 1). 


7.4.9. If the real scalar variables x and y are independently distributed, are the fol- 
lowing variables independently distributed? (1) u αχ and v = by where a and b are 
constants; (2) u = ax + b and v = cy +d where a, b,c, d are constants. 


7.5 Conditional expectations 


Conditional expectations are the expected values in the conditional distributions. 
In many of the applications such as model building, statistical prediction problems, 
Bayesian analysis, etc. conditional expectations play vital roles. Hence we will give a 
brief introduction to conditional expectations here. Two results which are basic will 
be listed first. 


Result 7.10. Whenever all the following expected values exist, 
Ely] = E,[Eybo] (7.24) 


and 


Var(y) = Var[E(y|x)] + E[Var(y|x)]. (7.25) 


In (7.24), the first expectation, E(y|x), is in the conditional space of y for all given x, 
and then this is treated as a function of x and then expectation is taken with respect 
to x. Thus the outside expectation is in the marginal space of x. But (7.25) says that the 
variance of any variable y is the sum of the expected value of a conditional variance 
and the variance of a conditional expectation under the assumption that there is a 
joint distribution of x and y and the expected values exist. 

Both (7.24) and (7.25) follow from the definition of expected values and conditional 
distributions. For the sake of illustration, (7.24) will be proved here for the continuous 
case. The proof of this for the discrete case and mixed cases and the proof of (7.25) are 
left as exercises to the students. 

ο ας 
γ y Roo) 


where g(y|x) is the conditional density of y given x, f(x, y) is the joint density and f,(x) 
is the marginal density of x. Now, treating the above as a function of x let us take the 
expected value in the marginal space of x. This is available by multiplying with the 
density of x and integrating out. That is, 


E[E(yb9] = MEE τω ay |, codx 


202 — 7 Joint distributions 


- | y| | focy)dx |dy 
jr] 
after canceling the non-zero part of f(x) 
EEY] = Í νπονὰν - EU). 
y 


Note that when we integrate out x from f(x, y) we get the marginal density f,(y) of y. 
The proofs for the cases when both x and y discrete or one discrete and one continuous 
are parallel, and hence left to the students. In computing the above expected values, 
we assumed the existence of the joint and marginal densities and the existence of the 
expected values. 


Example 7.14. For the following joint density, 


1 1 2 
X, = —— α-20-2-3χ) 
fy) 5 


for -co«y < co, 0 < x x1and f(x, y) = 0 elsewhere, compute (1): E[y]; (2): Var(y). 


Solution 714. From the given statement, it is clear that x has a uniform distribution 
over [0,1] (available by integrating out y from the joint density) or with the density 


fo0-1 Osxsl 


and f(x) = 0 elsewhere. We can compute the mean value and variance of x from this 
marginal density. That is, 


1 
E[x] - | xdx = l and similarly Var(x) = EM 
0 2 12 


If we divide the given function by Π(χ), we should get the conditional density of y 
given x, denoted by g(y|x). That is, 


1 1 2 
g(b)- ——e 30799, -oo«y«oo 
γ2π 


or y|x ~ N(u - 24 3x, 0? = 1), that is, the conditional density of y given x is normal with 
mean value 2 + 3x and variance 1. Therefore, the conditional expectation of y given 
x or the conditional mean value, which is the expected value of y in the conditional 
density of y, given x, is 


E(y|x) 2 24 3x, (i) 


and the conditional variance or the variance in the conditional density, in this case is 
Var(y|x) = 1. 

Then the expected value of this conditional expectation (expectation is taken in 
the marginal space x) and the variance of this conditional expectation are respectively 


7.5 Conditional expectations —— 203 


3 7 


E[E(ype] = E[2 + 3x| - 24 Ex] -2* 2 7 7 (ii) 
Var[E(y|x)] = Var[2 + 3x] = 3? Var(x) = T - T (iii) 


In order to get the marginal density of y, from the joint density f(x, y), one has to inte- 
grate out x. Here, the integral is over [0,1] since x is uniformly distributed. Hence no 
analytic form for the marginal density of y is available from our joint density. But we 
can compute E(y) and Var(y) by using Result 710. From (ii) above, 


Ely = EE] = 7. 
and 


Var(y) = E[Var(y|x)] + Var[E(y|x)] = ΕΠΙ] + 


lw 


Example 7.15. For the following joint density, 


1 


e^ 302-33? 
x 


fy) = 


for -co«y «co, 1€ x < oo and f(x,y) = 0 elsewhere, compute (1) E(y); (2) Var(y). 


Solution 7.15. The situation is similar to Example 714. Here, the marginal density is 
given by (available by integrating out y from the joint density) 


1 
fico = >> 1xx«oo 


and zero elsewhere. Therefore, the conditional density of y given x, which is f(x, y)/ 
ΠΩ), is given by the normal density part, excluding 5. Therefore, 


ylx ~ N(u2 2*4 35,0? =1) 
which gives 
E(y|x)=2+3x and  Var(yl|x) -1. (i) 


In order to compute E(y) and Var(y) by using Result 710, we need to compute E(x) and 
Var(x). Then 


X 
x? 


EQ) -0« | a= f 1 ax = [Inx]? = oo. 

1 1 X 

Therefore, E(x) does not exist, and hence Result 7.10 cannot be used to calculate E(y) 
and Var(y). 


Note 7.8. If E(x) does not exist, then all higher moments E(x“), a > 1 will not exist. 
If E(x”) does not exist, but all lower moments up to m exist then also all moments 
E(x), p 2 m will not exist. 


204 — 7 Joint distributions 


7.5.1 Conditional expectation and prediction problem 


An important use of the conditional expectation is in the area of prediction. An agri- 
culturist may want to predict the yield of tapioca under a certain chemical fertilizer 
and would like to answer a question such as what will be the yield if 200 grams of 
that fertilizer is used? Here, let y be the yield and x be the amount of fertilizer used. 
Then the question is: what is the predicted yield y if x is fixed at 200, or given x = 200? 
As another example, someone may be trying to reduce weight by exercise. Here, y is 
the reduction in weight and x is the number of hours spent on daily exercise. What is 
likely to be the reduction y if the exercise is 30 minutes daily, or x = 30, x being mea- 
sured in minutes? As another example, suppose we want to look at the cost of living 
index. Cost of living for a household in a certain township depends on many items 
such as per unit price of rice, say χι, per unit price of vegetables, x, per unit price of 
milk, x3, transportation cost x, etc. If cost of living is denoted by y, then it depends on 
many variables, x,,...,x,. What is the best function g(x,, ...,x;,) to predict y, where g 
is some function and we want the best function. We would like to use this function to 
predict y at preassigned values of x,, ... , x,, something like answering a question such 
as: What is the cost of living if price per kilogram of rice is Rs 20, price per kilogram of 
vegetables is Rs 15, etc. or at preassigned values of x, = 20, x, = 15, etc. 

Theoretically, any function of x,, ..., x, can be used as a predictor but our predic- 
tion may be far off from the true value. If an arbitrary function g(x,,...,x,), such as g = 
14 X4 + 2x5, is used to predict y then the error e in this prediction is € = y - g(x, ...,x;) 
or £(X,, ..., Xy) — y. One criterion that one can use to come up with a good function 
as a predictor is to minimize the distance between y and g. A mathematical distance 
between the random variable y and g(x, = a4, ..., xy = ay), where a, ..., αχ are the pre- 
assigned values of x,, ..., Xç, is 


NIB 


{Ely - £9 =a, ...,X, = a)]?] : (i) 


But minimization of (i), over all possible functions g is equivalent to minimizing its 
square or minimizing 


Ely-gl’ (7.26) 


over all g. Since g is evaluated at given points x, = a,,...,X, = ἄχ, it is equivalent to 
minimizing E[y — a]? for arbitrary a. This is already done in Chapter 3 and we have 
seen that the best value of a is given by a = E(y) at given values of x,, ... , x,. Hence the 
best predictor is the conditional expectation of y given Χι,...,Χχ. 


min Ely -g052a,..x,-ag)] > β-Εγ|χι....»Χχ]. (7.27) 


Therefore, the conditional expectation of y given x,, ..., x, is the “best” predictor of y 
at preassigned values of x,,...,x;,, best in the sense of minimizing the mean (expected 


7.5 Conditional expectations —— 205 


value) square error or in the minimum mean square sense. If other measures of “dis- 
tance” are used, then we will come up with other rules (other functions). Since the 
mean squared error is a mathematically convenient form, (7.27) is taken as the “best” 
predictor. 

Note that it may be possible to evaluate a conditional expectation if we have the 
joint distribution or at least the conditional distribution. In a practical situation, usu- 
ally neither the joint distribution nor the conditional distribution may be available. In 
that case, we may try to estimate the prediction function by imposing desirable con- 
ditions. 


Example 7.16. In villages across a state, it is found that the proportion x, of people 
having health problems and the proportion x, of people who are overweight (weight 
over the prescribed value by health standards) have a joint distribution given by the 
density 


f(%p%)HX, +X, OSX, <1,0<x,<1 
12 1 2 1 2 


and zero elsewhere. (1) Construct the best predictor of x, the proportion of people with 
health problems, by using x; the proportion of overweight people. (2) What is the pre- 
dicted value of x, if a village selected at random has 3096 of overweight people. 


Solution 7.16. We have already evaluated the marginal and conditional densities in 
this case. The conditional density of x, given x,, denoted by £;(x;|x;), is given by 
xX, +X 
βίαιο) =}, O<x, 51 
Χ2 + 5 


and zero elsewhere. Hence the conditional expectation is 


is the best predictor of x, at preassigned values of x;. (2) The best predicted value of 
χι at x; = 0.3 is then given by 


3*2 _ 29 
1 
X2 F 2 1χ 03 48 


Here, the joint density was available. Now, we will consider a case where we have only 
a conditional density. 


Example 7.17. The waiting time { at a bus stop is known to be exponentially dis- 
tributed but the expected waiting time is a function of the delay t, due to traffic con- 


206 — 7 joint distributions 


gestion on the way. The conditional density of t given {η is known to be of the form 


αν 
e391, O<t<oo 


tit) = 
δαν 342ti 
and zero elsewhere. (1) Construct the best predictor function of t, for predicting t; 
(2) What is the best predicted { if the traffic congestion delay is 5 minutes, time be- 
ing measured in minutes. 


Solution 717. For an exponential density, it was seen that the expected value is the 
parameter itself. Hence 


E(t\t,) — 3 4 2t, 


is the best predictor of t at given values of t,. For t, = 5, the best predicted value is 
34 2(5) = 13 minutes. 


7.5.2 Regression 


The word “regression” means to regress or to go back. This name came in because the 
original problem considered was to say something about ancestors by studying the 
offsprings. But now a days, “regression” means the area of predicting one variable by 
using one or more other variables. We have already seen that if we use the criterion 
of minimum mean square error then the conditional expectation is the best predictor 
function. Hence regression is defined as the conditional expectation. 


Definition 7.12 (Regression of y on x,,..., x,). The regression ofy on x,, ..., x, isthe 
best predictor function for y, best in the minimum mean square sense, at preas- 
signed values of x;, ..., x, and it is defined as the conditional expectation of y given 
Xi»... Xy OY E[y|x,, ... Xg], which is a function of x4, ..., x;. 


Regression analysis is an area which is often misused and misinterpreted in sta- 
tistical analysis. Regression is often misinterpreted as model building by using the 
method of least squares. It is nota model building problem but it is a search for the best 
predictor. Since regression is defined as a conditional expectation, regression analy- 
sis is done in the conditional space, the whole joint space of all the variables is not 
necessary to do regression analysis. But for correlation analysis we need the whole 
space of joint distributions and thus correlation analysis and regression analysis are 
not one and the same. As seen above, the best predictor or regression function can 
be constructed if either the joint distribution is available or the conditional distribu- 
tion is available. We have done examples of both. If we do not have either the joint 
distribution or the conditional distribution, then the regression function cannot be 


7.6 Bayesian procedure —- 207 


evaluated. But sometime we may have some idea about the conditional expectation 
that at given values of x, ...,x, may have a certain functional form for the expected 
value of y, such as a linear function of x4, ..., x, or a polynomial type function, etc. In 
that case, we may try to estimate that regression function by the help of the method 
of least squares. This aspect will be considered in later chapters, and hence further 
discussion will not be attempted here. 


Exercises 7.5 


7.5.1. Prove (7.24) for the discrete case and (7.25) for both discrete and continuous 
cases. 


7.5.2. For the joint density of x and y, 


f(%y)= el din Oxy«oo,0xxx1 
14x 
and f(x,y) = 0 elsewhere, compute (1) E(y); (2) Var(y|x); (3) Var(y); (4) the marginal 
density of y. 


7.5.3. For the joint density 


2 ae 
fQy)= (xpi? i 

for 0 < y «oco, 1 < x < co and zero elsewhere, compute (1) E(y); (2) Var(y|x); (3) Var(y); 

(4) the marginal density of y. 


7.5.4. Construct an example of a joint density of x and y where E(y|x) = 14 x + 2x? and 
(a) E(y) exists but E(y?) does not exist; (b) E(y) does not exist. 


7.5.5. Construct the regression function of x, on x;, x4 and show that it is free of the 
regressed variables x, and x; in the following joint density, why is it free of x; and x3? 


f(X1:X2:X3) = ce 1 aI 


for 0 < 33,35, X4 « co and zero elsewhere. 


7.6 Bayesian procedure 


Another area which is based on the conditional distribution is Bayesian procedures 
and Bayesian inference. The name suggests that it has something to do with Bayes' 
theorem, which was considered in Chapter 2, which had to do with inverse reasoning 
or going from the effect to cause. After observing an event, we are asking about the 
cause for the occurrence of that event. Here, we look at the same Bayes' rule in terms 


208 — 7 Joint distributions 


of random variables. Let f(x, y) be the joint density/probability function of two ran- 
dom variables. Let f, (x) and f(y) be the marginal density/probability functions and 
σι(χ|γ) and g;(y|x) be the conditional density/probability functions. Then we have the 
following relations: 


fy)  80bofioo) 


Sio hy) A hy) 
ϑι(χ|γ)ῃ(ν) 
Ξ ; 7.28 
Ξ2(ΥΙΧ) ρω (7.28) 


Let us interpret (7.28) in terms of one variable and one parameter. Let y be a parameter 
in a probability/density function of a real scalar random variable x, something like x 
at a fixed 0 may be an exponential density 


g|(xl0) -0e 9, 0zx«oo, 050 


and gi(x|0) = 0 elsewhere. But 0 may have its own distribution. Suppose that 0 has a 
gamma density with known shape parameter a and scale parameter f. Then 


1 
Γ(α)β 


6 
Ot 1e B, 


fX) = 


Then (7.28) in this context becomes 


g, Gl) _ f0) 
fi) fi) l 


How do we get the marginal density of x, namely fı (x), from the joint density f (x, 0) of 
x and 6, simply integrate out or sum up the other variable, namely 0. That is, 


(7.29) 


Ε2{θ|χ) = 


fio) = | f6,0)d0 = | g1(x|8)/5(0)d0 (continuous case) 


= Y f(x,0) = Y ει(α]θ)ρ(θ) (discrete case). 
0 0 


Here, gi (x|0) is the conditional probability/density function of x given 0 and f; (x) is the 
unconditional probability/density function of x. Thus (729) can be looked upon as a 
connection between conditional and unconditional probability/density functions of x. 
As far as 0 is concerned, f,(0) is the prior probability/density of 0 whereas g;(0|x) is 
the posterior, in the sense of after observing x, probability/density function of 0. Then 
what is the best predictor, best in the minimum mean square sense, of 0 in the light 
of the given observation on x? We have seen from Section 7.5 that it is the conditional 
expectation. That is, 


E(0|x) Ξ best predictor of 0, given x 
Ξ Bayes' predictor of 0. 


7.6 Bayesian procedure —— 209 


In the example of x|0 being an exponential variable and 0 being a gamma variable, 
we have the following computations: 


f 050) = g (xI8)f;(0) 


-θχ 1 a-la-$ 1 a, 9091 
M xaO S ENS S 
ΒΩ) - | fæ Ddo- — Γ θας ag 
θ B°T(Q) Jo 
Γ(α 4 1)(x * jr 
- PER (730) 


This is the unconditional density of x in this case. The posterior density of 0 is given 
by 


fog X+ jiu ag 0c D 
Roo τία11) 
for 0 « 0 « co and g,(6|x) Ξ 0 elsewhere. 


What 15 the best predictor of 0 in the presence of an observation on x or at given x? 
It is the conditional expectation of 0, given x. In the above example, 


gj(01x) = 


E(0lx) = | 0g,(6p0d6 = Gr gr 7 gne + Dag 
( =|, (|x) -a I, à 
(x+ tjari 1 \~(a+2) 
Bong e -αβεβ 
~ (x+5) T(a+2) = ne (731) 


What is the mean value of 0 before observing x? It is E(0) from the prior density of 6, 
which is E(0) = aß. Thus the mean value af is changed to d in the presence of an 
observation on x. 


Example 7.18. If a student is selected at random from a last year high school class or 
from the community of such last year high school classes, then the probability p that 
he/she will answer a question correctly depends upon the background preparation, 
exposure to the topic, basic intelligence, etc. For one student, this probability p may 
be 0.8, for another it may be 0.3, for another it may be 0.9, etc. This p is a varying 
quantity. p may have its own distribution. If a student is selected at random, then p 
for this student is a fixed quantity. If 10 questions of similar types are asked what is the 
chance that there will be x correct answers, something like 8 correct answers? Assume 
that p has a prior type- beta distribution with known parameters a and f. 


Solution 718. Using the standard notation, the probability function of x for a given 
p is binomial: 


gi(xlp) = M eps -201L33mo0sp«1 


210 —— 7 Joint distributions 


and zero elsewhere. In our example, there are 10 questions, then n 10, with the prob- 
ability of the correct answer is p and the number of correct answers is x 8. We as- 
sumed that p has a prior type-1 beta density. Then the joint probability function of x 
and p, denoted by f(x, p) is given by 


fp) =g 01D) AW) 


_ I(a« p) α- 1. a) X(1 — pyrx 
rary? a-p |  JPa-p) 


for a > O, B > 0, x =0,1,...,n; O < p < 1 and zero elsewhere. Then the unconditional 
probability function of x, denoted by fı (x), is available as 


foo | fx, p)dp 
p 


Αλία +p) TE _ ptn-x-1 
= FOG [» {1-) dp 


Gate rar orn- αρ. on 


Γ(α)Γ(β) Γία 1 β -ε n) 


Then what is the density of p for given x or the conditional density of p for a given x? 
Let it be g (plx). Then 


_ fp) 
82{Ρ|Χ) 7,00 


Tat+B+n) 
T(a 4 x)I(B n - x) 


qms py vtr 0 «p« 1 


What is the expected value of p given x? 


1 
E(plx) = j pe; (pbodp 


B I(a 4 B 4 n) T(a+1+x)I(B+n-x) 
- T(a+x)0(B +n-x) T(a+B+1+n) 
αχ 


ETT 


This is the Bayes' estimate of p or the best predictor of p at given x. If p was fixed, then 
we would have estimated p by the sample proportion, namely y. In the light of a prior 
type-1 beta distribution for p, the estimate has changed to τσ, 

In the above example, what is the distinction between the two estimates for p. 
^ is the estimate for the probability of success for a given student. If one student is 
selected at random and she gave 7 correct answers out of 10 questions of similar diffi- 


culties then 5 = 0.7 is the estimate for her probability of success. When p has its own 


10 


7.7 Transformation of variables —— 211 


distribution, then we are considering the probability of success in the population of 
such final year students across the spectrum. What is the estimate of this p across the 
spectrum, given the information that one girl from this spectrum gave correct answers 
for 7 out of 10 questions. Then the estimated value of p is 


ας’. 1547 _ 85 
a+B+10 15137110 152 


if a =1.5 and f = 3.7. Note that the Bayes’ estimate for p here, 7, can be smaller or 


? a+p+n? 
bigger than * depending upon the values of a and f. 


Exercises 7.6 


7.6.1. Let x given A > 0 be a Poisson random variable with parameter A. Let A have a 
prior gamma distribution. Compute (1) the unconditional probability function of x; 
(2) the posterior density of A given x = 3; (3) Bayes’ estimate of A. 


7.6.2. Let x given b be generalized gamma with density of the form 
gxb) = cx* te, x>0,6>0, a>0 


and c is the normalizing constant. Let b have a gamma distribution. Then answer (1), 
(2), (3) of Exercise 7.6.1. 


7.6.3. Let x|u ~ Ν(μ,1) and let u ~ Ν(0,1). Answer (1), (2), (3) of Exercise 76.1. 


7.6.4. Let x|a be uniformly distributed over [0,a]. Let a have a prior Pareto density $ . 
2«a«4 wherec is the normalizing constant. Answer (1), (2), (3) of Exercise 7.6.1. 


7.6.5. Let x|p be binomial with parameters (n = 10, p). Let p have a prior power func- 
tion density f,(p) = cp?, O < p < 0.7 where c is the normalizing constant. Answer (1), 
(2), (3) in Exercise 7.6.1. 


7.7 Transformation of variables 


In Chapter 6, Section 6.8, we considered transformation of variable involving one vari- 
able. Given the density f(x) of a real random variable x how to compute the den- 
sity of y = φίχ) where x to y is a one to one function of x or x can be uniquely writ- 
ten as x = $ (y). If g(y) is the density of y, then we have seen that the relation is 
f(x)dx = g(y)dy if y is an increasing function of x and f(x)dx = -g(y)dy if y is a de- 
creasing function of x. In the discrete case, there is no Jacobian of transformation and 
£(y), the probability function of y, can be computed by looking at the possible val- 
ues y can take and then computing the corresponding probabilities by using f(x), the 
probability function of x. 


212 — 7 Joint distributions 


Here, we will consider transformations when the real scalar random variables 
X, ..., Xy have a joint distribution. Consider the transformation 


γι = $1606, .... Xy) = (X) 
Vo = $5608, «++ Xk) = Φ2(Χ) 


yk = $65, .... Xy) = P(X) 


Let Y' = (γι,...,γχ) and let the transformation be written as Y = $, (X). Then if the 
transformation X to Y is one to one then we can write X uniquely in terms of Y as 
X- $. (Y). When all variables x, ..., x; are discrete and if there is a transformation 
(need not be one to one) then the probability function g(Y) of Y can be computed 
parallel to the one variable case. Look at all possible values Y can take then compute 
the corresponding probabilities by using the probability function f (X) of X. This will 
give g(Y). When all variables x,,...,x; or X are continuous and if X to Y is a one to 
one transformation, then there is a Jacobian of the transformation and the connection 
between the density f (X) of X and g(Y) of Y is that 


f(X)dX = g(Y)dY, 
where 
dX-2dx«^--A^dx, dY=dy,A---Ady;, 
ΠΠ 


OX; 
is the determinant of the matrix of partial derivatives τ. Then 
g(Y)dY-f(XdX = g(vyax-fooax 
1 
gQ) = sf ($ (v). (732) 


Let us do some simple examples to see the significance of the relationship in (7.32). 
Before taking up continuous cases, let us do one discrete case for the sake of illustra- 
tion. 


Example 7.19. Consider the transformation (γι = x,y; = 2x, + xj) and compute the 
joint probability function g(y,, y;) when x, x; have the joint probability function 


1/10, for (x, =0,x, =1) 

2/10, for (x, =0,xX = 2) 
Γίαι,Χ)) = 12/10, for (x, =-1,x, -- 1) 

5/10, for (x, =—-1,xX = 2) 


0, elsewhere 


7.7 Transformation of variables ---- 213 


Solution 7.19. The possible values y, = x? can take are y, = 0,1. The possible values 
y, can take are yy = 2(0) + 1 = 1; 2(0) +2 =2; 2(-1) + 1 = -1; 2(--1) + 2 = 0. Hence we have 
(¥1,¥2) = (0,1) with probability T (¥1,¥2) = (0,2) with probability a: and so on. No 
points here coincide, and hence the points are all distinct. [If some points coincided, 
then we should add up the corresponding probabilities.] Hence the joint probability 


function g(y;,y>) is given as 


1/0, for (y;,y>) = (0,1) 
2/0, for (y,,¥>) = (0, 2) 

ΕΦ Υ2)Ξ 12/10, for (yy,¥2) = (1, -1) 
5/10, for (Υι.Υ2) = (1,0) 
0, elsewhere. 


Example 7.20. Let x, and x, be independently distributed as uniform random vari- 


ables over [0, a] and [0, b], respectively, that is, the densities of x, and x, denoted by 


f, (3) and f,(x,), respectively, are fí(x;) = 1, 0 < x, < a and f,(x,) = }» 0<x <b and 


a 
zero elsewhere. Compute the densities of (1) u =x, + x5; (2) v 2x4x5; (3) w = = 


Solution 7.20. Due to product probability property or statistical independence, the 
joint density, denoted by f(x, y), is the product, that is, 


foy)» cp. O<x,<a O<x <b 


and zero elsewhere. Let us make the transformation y, = x, + X2, Y2 = x; so that it is a 
one to one transformation x, = γ»,Χι = y, - y>. It is a linear transformation with the 
matrix of the transformation is 


1 1 
loi 
Hence if g(y,, y2) is the joint density of γι and y;, then it is the same as + but the 
region in the (y,, y;)-plane will be different. The region will be the region bounded by 
the lines x, =O > y; - y2 = 0, X% =4 > y1 -y2 = 4, X = 0 > y; 20, x; = b S y; = b. Thus 
the rectangle in the (χι, x;)-plane transforms to a parallelogram in the (y,, y;)-plane as 
shown in Figure 7.5. 


The marginal density of y, is available by integrating out y;. From Figure 7.5, note 
that when 0 < y, x a the integration of y, is from 0 to γι. That is, 


When b is greater than a, then in the interval a x y, x b, the integration is from y; = 
yı -a to yı. When b xy, x a + b, then the integration is from y; - a to b. That is, 


214 — 7 Joint distributions 


T γατα 


A^ 


Figure 7.5: Left: Region in (x,,x,)-plane; Right: Region in (y4, y;)-plane. 


and 


b 1 b+a-y 
dy, = t, 
| =y,-a ab " ab 


2 


Thus the density of y,, denoted by g;(y,), for b > a, is given by 


ET O<y,<a 

a 

£, as<y,<b 

b 1 
811) = DN 

apo b<sysa+b 

0, elsewhere. 


We can verify that it is a density by integrating out and showing that it gives 1. That is, 


a+b 1 a b a+b 
| gı0ı)dy; = | γιάγι a | dy, + | [(a + b) - γι]άγι 
0 ab 0 a b 


2 

- {5 +ab-a)+ (a+ ba - la b? --ν]] 
ab 

"ab 


Now, let us look at the density of v = χιχ). Again, let us use the same notations. Let 


γΙΞ X1X5, y; = X2, Which means x, = y>, x; = y Then the Jacobian of the transformation 
2 
1 


is? =, 
E From Figure 7.6, observe that in the (y,, y;)-plane the integration of y; to be done 
from y, = “ to b. The joint density of x, and x, is f(x;,x5) = + forO<x,<a,O0<x) <b. 
The joint density of γι and y,, denoted by g(yj,y>), is then g(;,y>) = n including 
the Jacobian and then the marginal density of y, is given by 


aowi [. Lay 
ts ab y,/a Y2 e 


7.7 Transformation of variables —— 215 


1 


ah [Inab-Iny,], O<y,<ab 


1 
zx ap 2a = 


and zero elsewhere. Let us see whether it is a density function by checking to see 
whether the total integral is 1 since it is already non-negative. 


Ὧν 


Figure 7.6: Region in the (y,,y>)-plane. 


ab 1 ab 
| a (y)dy, = — | (In ab -- Iny, dy, 
0 ab Jo 


1 
= ap (n abYab) - [yı ny; -yJej 
αι 

ab 


Hence it is a density. 

Now we look at the density of w = τ . Again, we use the same notations. Let γι = x 
y» = X», which means, x; = y;, x; = γιγ». Then the Jacobian is y;. Therefore, the joint 
density of y, and y; is 


y» 


gU» y3) B ab 


The region in the (y,, y;)-plane is given in Figure 77. 

Then x = 0 > y; = 0; X% = b > y, = b; x = 0 > y; = 0; X =a > yy, =a, which is 
a part of a hyperbola. Hence the integration of y, in the range 0 < y < b is from 0 to 
b and the integration in the range 7 < y; < oo is from 0 to y and the Jacobian is y,. 


g 33 a? 
b Figure 7.7: Region in the (γι, y>)-plane. 


216 —— 7 Joint distributions 


Therefore, the marginal density of y, denoted by gi(y,), is given by 


b 
ab h ydy»  Osyis$ 
διΦι) = dr [As y2dy; 


0, elsewhere. 
1 δ} a 
az: OSNS5 
E 1d a 
= 1 ob 7? b < yı < 
0, elsewhere. 


Let us see whether it is a density. The total integral is given by 


co 1 a/b b2 co 42 
| fix zz f zs S ay, 


ab 2 ajb 2y2 
2 2 400 
zc ο) πη. 
208.4 
ab 


Hence the result. 


In the example above, we have done three forms, namely the sum, product and 
ratio. The students are advised to go through the geometry of the transformation from 
Figures 7.5, 7.6 and 7.7 so that the limits of integration are taken properly. Now there is 
only one more basic structure left, which is the density of the difference between two 
random variables. This will be illustrated by taking a simple example of an exponen- 
tial distribution. 


Example 7.21. Suppose that x, and x, are real scalar positive random variables, in- 
dependently distributed as exponential with different parameters. Let the marginal 
densities be 


lod 


6j 


— e š 
6; 


fia) = x; 20, 6,50, i=1,2 


and zero elsewhere. Compute the densities of (1) u =x, + x5; 2) v 2x - x;. 


Solution 7.21. Transformation of variable technique for a sum is already illustrated in 
Example 7.19. Now, we shall try to find the density of y, by using moment generating 
function. Let the mgf of x; be denoted by M, (t;), i = 1,2. Since the variables are as- 
sumed to be independently distributed the mgf of the sum is the product of the mgf's. 
From straight integration M, (f;) = (1— 6; t,) 1. [This was evaluated for the gamma den- 
sity already and exponential density is a special case of the gamma density.] Hence 


7.7 Transformation of variables ---- 217 


the mgf of the sum x, + x, is given by 
Ε[εἴχιηχ2)] = Ε[εχι]Ε[ούο] 
-2(1-6,07'(1-0,t). (733) 
But 


1 0 1. 0; 1 
(1-0,00-0,0 0,-0,1-0,t 0,—0,1-0,t 


(234) 


by using the partial fraction technique, when 6, + θ). If 0, = 0, = 0, then (7.33) reduces 
to the mgf of a gamma random variable, and hence y, has a gamma density with the 
parameters (a = 2,8 0). When 6, + 6,, then the sum on the right side in (734) can 
be inverted because each term is a constant multiple of the mgf of an exponential 
variable. Hence the density of u, denoted by g; (u), is given by 


1 -# 1 m 


au) - aE '* Ga 2, 


for u>0, 0; > 0, i =1,2, θι + 0 and zero elsewhere. [The student may verify this result 
by using transformation of variables as done in Example 7.19.] 

Now, we shall look at the density of v = x, — x. In the (x,,x,)-plane the non-zero 
part of the density is defined in the first quadrant, {(Χι 12) | O < x, < 00,0 € x; < oo}. 
Let us use transformation of variables. Let y, = x, — X2, y; = X, the Jacobian is 1 the 
joint density of γι and y;, denoted by g(y,,y>), is given by 


1 Emy) Ev) 
Είγι»γϱ)Ξ ---Ε 4 ος 
0,0; 


Now let us look at the region in the (y4, y;)-plane where the first quadrant in (x4, x;)- 
plane is mapped into. x, = 0 2 y; = 0; x; - co 2 y; — oo; which is the region above 


the y,-axis. x, = 0 > y; = -y4; X > co => y, * y; — co, and hence the region of integra- 
tion is what is shown in Figure 78. 


T 
A 5 


LE 
y) γι 


2» 


Figure 7.8: Region of integration. 


218 — 7 Joint distributions 


Hence when y, > 0 then y, to be integrated out from zero to infinity and when y, < 0 
then y, to be integrated out from -γι to infinity. If the marginal density of y; is denoted 
by 811), then 


1 .^ -(i4 diy. 
gy) 0.0,* d [ε 5 Ὁ »ἠγ) 
102 
γι y 
-ᾱ (61902) - 
6 δ - 0:0) Y2] E e 
—[Ó m E < 
2 Gray! i Io» 0syi«oo _ J) Gry Όσγιςουο (735) 
-3 _ +05 5 * 
e" [ ge» L E seg. o 
$5251 e μα οο«γι«0 το, “ο «γιςθ 


and zero elsewhere. It is easily verified that (7.35) is a density. 


Exercises 7.7 


7.7.1. Use transformation of variable technique to show that the density of u = x, + x; 
is the same as the one obtained by partial fraction technique in Example 7.20. 


7.7.2. Verify that (7.35) is a density. 


7.7.3. If χι and x, are independently distributed type-1 beta random variables with dif- 
ferent parameters, then evaluate the densities of (1): u = xjx5; (2): v= x 


7.7.4. Evaluate the densities of u and v in Exercise 77.3 by using the following tech- 
nique: Take the Mellin transform and then take the inverse Mellin transform to get the 
result. For example, the Mellin transform of the unknown density g(u) of u is available 
from E[us'] = E[x? ! ]JEpG 1] due to statistical independence and these individual ex- 
pected values are available from the corresponding type-1 beta densities. Then take 
the inverse Mellin transform. 


7.7.5. Let χι and x, be independently distributed gamma random variables with the 
parameters (αι, β) and (a5, B) with the same beta. By using transformation of variables, 
show that u = πο is type-1 beta distributed, v = x is type-2 beta distributed, w = x, +x, 
is gamma distributed. [Hint: Use the transformation x, = r cos? 0, x, = rsin? 6. Then 
J «2r cos0sin0.] 


7.7.6. Prove that Γ(2) = yn. Hint: Consider 
[{2)| - GG) 
- Here NEZ 
0 o 


and make the transformation x = r cos? 0, y = r sin? 0. 


7.7 Transformation of variables —— 219 


7.7.7. Show that 


T(a)T(B) 
Γία 4 β) 


for R(a) > 0, R(B) > Ο. Hint: Start with Γ(α)Γ(β) and use integral representations. 


1 
| x*-(1- xy-dx- 
0 


7.7.8. Let u = τ where x, and x, are independently distributed with x, = x7,/m and 
x; =X?/n. Here, y? means a chi-square variable with v degrees of freedom. Show that 
u is F-distributed or u has an F-density of the form 


(=) (m4? F? 
Ε)- m+n 
209 rei n ) (1+ MF)" 


for 0 < F «co, m,n=1,2,... and zero elsewhere. 


7.7.9. In Exercise 7.7.8, show that x = ZF has a type-2 beta distribution with the param- 
eters 7 and 2. 


7.7.10. Let u = τ where x, and x, are independently distributed with x, ~ N(0,1) and 
X =X2/v. That is, x, is standard normal and x, is a chi-square with v degrees of free- 
dom, divided by its degrees of freedom. Show that u is Student-t distributed with the 
density 


f(w= re$ (ns 


Vt 


for -o0 < u < œ, V=1,2,.... 


A note on degrees of freedom 

In general, k “degrees of freedom” means free to vary in k different directions. The 
phrase “degrees of freedom” appears in different disciplines under different contexts, 
each having its own interpretation. We will indicate how it is interpreted in statistical 
literature. 

The moment generating function (mgf) of a real gamma variable x with the pa- 
rameters (a, f) is 


M,(t) = à - Bt) *. 


A chi-square variable with m degrees of freedom, y7,, being a real gamma variable with 
B-2,a- S, has the mgf 

Mg (020-2072. 
Hence, if x7, and x? are independently distributed then u = x7, + x4 has the mgf 


M,(0--20-:-20-3 20-2077 


220 —— 7 Joint distributions 


which is the mgf of a chi-square with m + n degrees of freedom. Hence when x2, and 
X are independently distributed then 


Xin *Xn Xin: 
Extending this result, we have 
Nin = Uy tees + ug 


where u; = x? or a chi-square with one degree of freedom, where u,, ... , Um are indepen- 
dently distributed. But we have seen that when x; ~ N(0,1) then x? ~ x?, a chi-square 
with one degree of freedom. Hence 


Nin = Xft + XP, (736) 


where x? is the square of a standard normal variable and x}, ... , x,, are independently 
distributed. Hence “m degrees of freedom” here means that the x7, can be written as 
the sum of squares of m independently distributed standard normal variables. 


8 Some multivariate distributions 


8.1 Introduction 


There are several multivariate (involving more than one random variable) densities, 
where all the variables are continuous, as well as probability functions where all vari- 
ables are discrete. There are also mixed cases where some variables are continuous 
and others are discrete. 


8.2 Some multivariate discrete distributions 


Two such examples, where all the variables are discrete, are the multinomial probabil- 
ity law and the multivariate hypergeometric probability law. These will be considered 
first. 


8.2.1 Multinomial probability law 


In Bernoulli trials, each trial could result in only one of two events A, and A), Αι U 
A, = S, A, N A; = φ where S is the sure event and φ is the impossible event. We called 
one of them success and the other failure. We could have also called both “successes” 
with probabilities p, and p, with p, + p; = 1. Now, we look at multinomial trials. Each 
trial can result in one of k events A,, ..., A, with A; NA; =  foralli + j, A,U--- UA, =S, 
the sure event. Let the probability of occurrence of A; be p;. That 15, P(A;) = pi, i= 
1,...,k, pı +--+ +p, = 1. Suppose that persons in a township are categorized into various 
age groups, less than or equal to 20 years old (group 1), more than 20 and less than 
or equal to 30 years old (group 2), 30 to 50 (group 3), over 50 (group 4). If a person is 
selected at random from this township, then she will belong only to one of these four 
groups, that is, each trial can result in one of A,,A,,A3,A, with A; N Aj = @,i#jand 
A U- UA, = S = sure event. Each such selection is a multinomial trial. If the selection 
is done independently, then we have independent multinomial trials. 

As another example, suppose that the persons are categorized according to their 
monthly incomes into 10 distinct groups. Then a selected person will belong to one of 
these 10 groups. Here, k = 10 and in the first example k = 4. 

As another example, consider taking a hand of five cards with replacement. There 
are four suits of 13 cards each (clubs, diamonds, hearts and spades). If cards are se- 
lected at random with replacement, then p, = z = i =P) = P3 = Py. 

A general multinomial situation can be described as follows: Each trial results in 
one of k mutually exclusive and totally exhaustive events A4, ..., Αχ with the proba- 
bilities p; = P(A;), i=1,...,k, pj + --- +p, = 1, and we consider n such independent 
trials. What is the probability that x, times the event A, occurs, x, times the event A, 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. ABARH] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https://doi.org/10.1515/9783110562545-008 


222 — 8 Somemultivariate distributions 


occurs, ..., Xy, times the event Αι. occurs, so that x, + --- +. x, =n = the total number of 
trials. We assume that the probabilities (p4, ..., py) remain the same from trial to trial 
and the trials are independent. Let us denote the joint probability function of x4, ... , xy. 
by f (a, ...,x,). For any given sequence of x, times A,, ... , x; times Αχ, the probability is 
Dip» -- p. Hence the required probability is this probability times the total number 
of such sequences. Note that n can be permuted among themselves in n! ways, x, in 
xı! ways and so on. Since repetitions are there, the total number of distinct sequences 
possible is 
n! 

Therefore, 

fos) LI. y epu (8.1) 

X4! Xg! 

for x; = 0,1, ...,n; O0 <p; <1, i=1,...,k; X1 +X + © + Xp = N; pı + + + pp = 1; and zero 
otherwise. Since there is a condition x, + --- +x, =n, one of the variables can be writ- 
ten in terms of others, and hence f (x1, ...,x,) is a (k - 1)-variate probability law, not 
k-variate. For example, for k = 2 we have 


ΧΙ ΥΩ 


fox) = x, x1 D? 


LL p“(1- n-x, 
χπα- xi -p)) 


which is the binomial law. Note that the multinomial expansion gives 


n! 
(pj +: +p)" = Y πο ος (8.2) 


Xy+ +X =n xg Xk 
What is the joint moment generating function for the multinomial probability law? 
M(t, Hog ty) = E [eft eoa] 


since there are only k - 1 variables, and it is 


! 
- X une) 


Xt en Xx XK 
= (p,eh +- + p, e bs (8.3) 
available from (8.2) by replacing p; by p,e“, i=1,...,k - 1and p, remaining the same. 


This mef is differentiable as well as M E Hence we should get the integer mo- 
ments by differentiation. 


9 
Ex) Ξ a; MC» mt t) 
i EE 1-0 
N -1 
= np;e'(p,e^ +--+ py ,e': + py)" n =0,.. t10 


=np(p, + +p)" =np, ἰ-Ί.....Κ--1. 


8.2 Some multivariate discrete distributions — 223 


But 
E(x,) = E[n- xX -— +++ - xy 4] = n- np, -+ - npy = Np. 
Hence the formula holds for all i=1,...,k or 
E(x)-2np, i=1,...,k. (8.4) 


Forizj, 


ð ὃ 
E(x;x;) = at. a, Mt mn ty) 
pv" t,=0,...,t,_,=0 


9 E 

t t ο. n-1 

= np;e' — (p,e^ + t+ ppe! tpi) 
ot; tj-0-----ty 4 


- npi(n - 1)pj. 
Hence the covariance between x; and Xj fori +j, 


Cov(x;;X;) =E (μα) - E(x;)E 03) 


-n(n- pip; - (np)(npj)) = -npip;, i#j=1,...,k. (8.5) 
2 


a 
E(x?) = = aM (ty, ih) 
1 


οί t=0=--=ty ἰ 


9 3 
= —np;e'(p,e^ +--+ peet + py)" 
ot; tj-0---—ty ἰ 


- n(n - Dp? «πρι. 
Hence the variance of x; is given by 
Var(x;) 2 nn - Dp? + np; = (np) -npj(1- pj, i=1,2,...,k. (8.6) 


For i=1,...,k - 1, they come from differentiation and for i = k by substitution. But 
Cov(x;,x;) = -npipj, i #j =1,...,k. Hence the covariance matrix for x, ..., x, , will be 
non-singular and positive definite but for x4, ... , x, it will be positive semi-definite and 
singular. The singular covariance matrix, denoted by Σ, is then given by 


χι 
X = Cov(X) = Cov 
Xk 
πρι(1 -- pi) —npip» EE —hpipy 


=| -προρι πρχί-ρ)) ...  -npyy |. (8.7) 
: : np, (1 -- px) 


This matrix Σ is a singular matrix of rank k - 1. 


224 —— 8 Somemultivariate distributions 


Example 8.1. A balanced die is rolled 10 times. What is the probability of getting 
5 ones, 3 twos, 2 sixes? 


Solution 8.1. Since it is told that the die is balanced, we have a multinomial law with 
k=6,p, =- = pg 1 Now, we have a multinomial law with n = 10, x, = 5, x; = 3, 
X3 =0 = x, = X5, X; = 2. Hence the required probability p is given by 


n! 


yea μ΄ 

"sao s) (2) (2) (2) (3) (2) 

~ 5310101012167 \6/ 6) \6/ \6/ \6 
10 

- 2520( =) = 5 = 0.00004. 


Example 8.2. At Thekkady wild-life reserve, suppose that on any given day the prob- 
ability of finding a tourist from Kerala is 0.4, from Tamilnadu is 0.3, from other states 
in India is 0.2 and foreigners is 0.1. On a particular day, there are 20 tourists. What is 
the probability that out of these 20, 10 are from Tamilnadu and 10 are from other states 
in India? 


Solution 8.2. We can take this as a multinomial situation with k = 4, n = 20, p, = 0.4, 
D> = 0.3, p; = 02, p, = 0.1, x; = 10, x4 = 10, χι = O=x,. The required probability, p, is 
then given by 


po pr 
x, ex! 
| 20! 
~ 10!10!0!0! 
= (11)(13)(17)(19)(4)(0.3)!°(0.2)". 


= 0.0000001 


(0.4)°(0.3)!9(0.2)!9(0.1)° 


8.2.2 The multivariate hypergeometric probability law 


This law is applicable when sampling is done without replacement. A given trial may 
result in one of k possible events but the trials are not independent. Suppose that there 
are a, objects of one type, a, objects of a second type etc and αχ objects of the k-th 
type. Suppose that these a, + --- +a; objects are well shuffled and a subset of n objects 
is taken at random. At random, here means that every such subset of n is given and 
equal chance of being included. This experiment can also be done by picking one at a 
time at random and without replacement. Both will lead to the same answer. The total 
number of sample points possible is (^*7**). If we obtain x, of a, type, x; of a; type, 
etc. and x; of a, type so that x, + --- +x, =n then the total number of sample points 


8.2 Some multivariate discrete distributions —— 225 


favorable to the event of getting x;,... xy is (2)(2) = (3). Hence the probability of 


getting x, of a, type,..., x, of a, type, denoted by f(x, ..., x), is given by 


(2) (2) 
foa... Xp) = (ntum) (8.8) 
n 
for x; =0,1,...,N OF Qj X4 ++ +X, =n; N=1,2,...;i=1,...,k and zero elsewhere. 
Note that it is a (k -- 1)-variate probability function because there is one condition 
that x, +- +x, = n so that one variable can be written in terms of others. From Chap- 
ter 3, we may note that 


Ps !ίαι»...»Χχ) -1 


ΧΕ ἜχχΞΠ 


Σ (-)-(5)- Ἔν ας 
αν X4 Xk X4 tocco XK n 


In this probability function, since factorials are appearing in the denominators, facto- 
rial moments can be easily computed. 


apn apn a;-1 

. * Qi| wx x a; m cíai;-1 
Nx} αχ) | xo] 

χο l Xj=1 l x=1 l 


since 


Therefore, 
(2) "n (&) 
Ex) = » Com) 
Xy +x SN n 
5 ay (teet) 7 na, 
(AUF tae.) αι τ +ak 
Similarly, 
n(n —-1)a,(a, — 1 
E[x,(x, - 1] = (= Dad) > 
(A, +- +a) (a+  ται-τ1) 
Var(x;) = Ε[χι(αι - 1)] + Ε[χι - [Eœ]? 
7 n(n - 1)a;(a; -- 1) nd; na? (89) 
"(atra (ar +Q,-1) ate ταν. (aera ᾿ 
n(n —1)aia 
ΕΙΧΙΣ)] = πε ναι, > 
(a+: + ακ)ίαι +++» + ax -1) 
n(n - 1)a;a; n^a.à; 
Cov(x;, xj) = Ce -- (8.10) 


(ap*---aQ(aq*---a,-1) (αι τι Κα 


The joint moment generating function goes into multiple series, and hence we will not 
discuss here. Also note that the variance does not have a nice representation compared 
to the covariance expression in (8.10). 


226 —— 8 Some multivariate distributions 


Example 8.3. From a well-shuffled deck of 52 playing cards, a hand of 8 cards is se- 
lected at random. What is the probability that this hand contains 3 clubs, 3 spades 
and 2 hearts? 


Solution 8.3. This is a multivariate hypergeometric situation with k = 4, a, = 13 = 
Ay = A3 = Ay, X, = 3, X; = 3, Χα = 2, x, = 0. Hence the required probability is given 
by 


eG) 00000) 
(SIME) ($) 
ο (0238)006)(07)(8) | (13)(12)11) | (13)(12) 
(52)(51)(50)(49)(48)(47)(46)(45) L DDB ! MD 
— _(43)(13)(11)(11)(4) 
- (47)(23)(5)(15)(17)(7) 


= 0.008478. 


Exercises 8.2 


8.2.1. In a factory, three machines are producing nuts of a certain diameter. These 
machines also sometimes produce defective nuts (nuts which do not satisfy quality 
specifications). Machine 1 is known to produce 40% of the defective nuts, machine 2, 
30%, machine 3, 20% and machine 4, 10%. From a day's production, 5 nuts are selected 
at random and 3 are defective. What is the probability that one defective came from 
machine 1, and the other 2 from machine 2? 


8.2.2. Cars on the roads in Kerala are known to be such that 40% are of Indian make, 
30% of Indo-Japanese make and 30% others. Out of the 10 cars which came to a toll 
booth at a particular time, what is the probability that s are Indo-Japanese and 4 are 
Indian make? 


8.2.3. A small township has households belonging to the following income groups, 
based on monthly incomes. (Income group, Number) = (<10000, 100), (10000 to 
20 000, 50), (over 30 000, 50). Four families are selected from this township, at ran- 
dom. What is the probability that two are in the group (10 000 to 20 000) and two are 
in the group (<10 000)? 


8.2.4. Aclass consists of students in the following age groups: (Age group, Number) = 
(below 20, 10), (20 to 21, 15), (21 to 22, 20), (above 22, 5). A set of four students is 
selected ar random. What is the probability that there are one each from each age 
group? 


8.2.5. In Exercise 8.2.4, what is the probability that at least one group has none in the 
selected set? 


8.3 Some multivariate densities —— 227 


8.3 Some multivariate densities 


There are many types of multivariate densities in current use in statistical literature. 
The most commonly used ones are multivariate normal, Dirichlet type, Dirichlet 
type2 and multivariate and matrix variate gamma. 

Corresponding to a univariate (one variable case) density, do we have something 
called the unique multivariate density? For example, if x ~ N(0,1), standard nor- 
mal, and if we have a bivariate density f(x,y) such that f(x, y) 2 0 for all x and y, 
[, J foc y)dx ^ dy =1, and f fx y)dx = f(y) ~ N(0,1), f fGcy)dy = fi) ~ N(. 1. 
Is f(x,y) a unique function or can we have many such f(x,y) satisfying the above 
conditions having the marginal densities as standard normal? The answer is in the 
affirmative and we can have many functions satisfying all the above conditions. Hence 
there is nothing called the unique multivariate analogue of a given univariate density. 
As two examples, we can give 


ef (x,y) e be py), 
2π γ 1--ρ2 

for -1« p < 1 are two functions which are both multivariate analogues of a standard 
normal density. Since there is nothing called a unique multivariate analogue to any 
given univariate density, some functions are taken as multivariate analogues due to 
some desirable properties. But the student must keep in mind that when we take a par- 
ticular multivariate density as an analogue of a given univariate density this function 
is not taken as the unique multivariate analogue. It is only one multivariate analogue. 


1 
fixy) = 2π 


8.3.1 Type-1Dirichlet density 


This is one generalization of a type-1 beta density. The following is the form of the 
density: 


04-1, 05-1 1 


PE 
Fy oe Xp) = Xz xy ex; 


x (1 χιτ. -χκ)ῆη-1 


for R(a;) >0,j=1,...,k +1; Q,..,x)€Q0, Q = (G...x)9lOsx; < Liz L...,k 
χι ++ +X, < 1}, and f(x}, ...,x,) = 0 otherwise. In statistical problems, usually the 
parameters are real and then the conditions will be a; > 0, j = 1,..., k + 1. Note that 
for k = 1 we have type-1 beta density, and hence the above density can be taken as a 
generalization of a type-1 beta density. By integrating out variables one at a time, we 
can evaluate the normalizing constant c. For example, let 


1-Χχ-'''- Xk 
a; 04-1 
Ix, = | χι 


xi-0 


x [1 =X% >e - xX, ]% dx, 


228 — 8 Somemultivariate distributions 


1-X;-:-—Xy 
1 04-1 
- 4 -Χχ- Ξ-χι) σα f 1 
Qj,1-1 
X k+1 
x |: = 1 | p 
1-x;--:--Xy 
= x 
Put y; = i-e then 
αιτακιι-1 
L =a XG) kl 


1 
-1 - 
: | yr Q-y)*ecdy, 
= (1 -xX CES = xy) ttal 


d I (a4)I (85,4) 
T'(a4 0,4) 


for R(a4) 20, R(aj,,) > O or q; > 0, ἄχ > O if real. Proceeding like this, the final result 
is the following: 


| f GG, 00 XOX, A+ A dx = CD, 
Q 


where 
Dy = D(a, mn ας Q4) 
ΘΝ (8.11) 
Γίαι + +++ + aky) 
Therefore, 
1 
D,- -. 
E 


The product moment E [x]. Eid is available from (8.11) by replacing a; by a; + h;, 
j=1,...,k. That is, 
E[x™ E = D(a, F hi, e Ωχ + hk; 0,44) 
Βίαι.... 0450,44) 
a Γίαι s hi) "T Γίαι F hy) 
I (aj) I (ay) 
" T(a + + aky) 
T(a, +h +- ray hy 04) 


(8.12) 


for RG; + hj) > 0, j=1,...,k. This means that if the ας are real then the moments will 
exist for some negative values of h; also. Some basic properties of type-1 Dirichlet are 
the following. 


Result 8.1. If (x;,...,x,) has a type-1 Dirichlet distribution, then every subset of 
0, ..., Xy) is also type-1 Dirichlet distributed and the individual variables are type-1 
beta distributed. 


8.3 Some multivariate densities — 229 


The proof follows by using the property that arbitrary product moments (8.12) will 
uniquely determine the corresponding distributions. Retain h; for the variables in the 
selected subset and put the remaining A's zeros and then identify the correspond- 
ing distribution to show that all subsets have the same structure of the density or all 
subsets are type-1 Dirichlet distributed. 


Result 8.2. If (4, ..., Xx) has a type- Dirichlet distribution, then y, 2 1— x, — --: — Xk 
and y, = xi + --: + x, are both type-1 beta distributed. 


For proving this, let us consider the h-th moment of 1- x, —---—x;, for an arbitrary h. 
1 Ξ = 
ΕΠ Χι--'' x,]^ = | xi D^ 1 
Di. Q 
x(1 -χι-''' - xp) κ 1άχι ^ ^ dx, 
= Γίάκι T h) 
I(a,1) 


(8.13) 


for R(ay,, + h) > 0. But (8.13) is the h-th moment of a type-1 beta random variable with 
the parameters (aj,,, a, +--+ + αχ). Hence γι is type-1 beta distributed. For any type-1 
variable z, 1—z is also type-1 beta distributed with the parameters interchanged. Hence 
1-y, = y; is type4 beta distributed with the parameters (αι + --- + Ak, 4). 


Example 8.4. If the multinomial probabilities have a prior type-1 Dirichlet distri- 
bution, then derive (1) the unconditional distribution of the multinomial variables; 
(2) the posterior distribution of the multinomial parameters. 


Solution 8.4. The multinomial probability law for given values of the parameters 
Dv pia is given by 


n! χι χι 
Xay fos Nea Dis ο = ———— ος 
S104 κ-ι|Ρι Pki) a! Pia 
x (1-p1 ο 


for x, + ++ +X, = n, and zero otherwise. Let the prior density for p4, ... , p, ; be a type-1 
Dirichlet. Let 


Γίαι +: +O) a-1 αι 
Τίαι)-::Τ(αχ) ! 


πα”. = Pra) 


fp, δι. 1) = 


for 0 <p; <1, pı ++ +P <1, 4; >0,j=1,...,k and all known, and f;(p,, ...,py 4) = 0 
elsewhere. Then the joint probability function of x,,...,,_1,)1)---»Px_1 is given by 


ϑιίχι»...»Χκ.ι|Ρι»-..»Ρκ Of2(0 S Pica. 


230 — 8 Some multivariate distributions 


(1) The unconditional probability function of x,,...,x,_,, denoted by f,(x,,..., 
χι. |), is available by integrating out p,, ... , py. 4. 


fio os Xx) 
E: n! MEL a, +X,-1 
x, xy! Tay) Πακ) Jo^! 


Oa Xia 


Ng LPS ppg TOXEUSE HOBUR A dPpk 
| n!  I(at---ay) 
-χμ».χμ T(Q,) ::: Γ(αι) 
- T(Qy +X) I(ay 4 + xy E(N -X -+ -Χειται) 
T(a, +- 1- ἂχ 4n) 


(2) The posterior density of p,,...,p,_, is available by dividing the joint proba- 
bility function by f,(x,,...,x,_,). Denoting the posterior density of p,,...,p,_1, given 
Χι»...»Χχ.η» by ϑγ(Ρι»...»Ρκ η|Χι»-..»Χχ. 1) we have 


82ίρι,...»Ρκ-1ἰχι» Xa) 
Γίαι +++ ται 1-71) 


Γίαι 1- X1) --: Γ(αχ 4 + Xy 4)F(n—-Xx1- + Xx 4 αι) 
x ppt prier 
Ὃν. 


for (p, ...,pj) € Q, R(a;) > 0,j=1,...,k, xX; =0,1,...,n,j=1,...,k-1and (p, ... pil 
Xp ...,Xy4) = 0 elsewhere. These density functions (1) and (2) are very important in 
Bayesian analysis and Bayesian statistical inference. 


8.3.2 Type-2 Dirichlet density 


As an extension of type2 beta density, we have the type2 Dirichlet density. 


1 = - 
Γία»....Χχ}Ξ p^ ree ole (oe e ee xu) ET 

k 
for R(a;) >0,j=1...,k+1, x; 20, j=1...,k, and f(x,,...,x;) = 0 elsewhere. Going 
through the same steps as in type-1 Dirichlet case, we can show that 


29 9? αι αι 1 
| «| ΧΙ e (1-9 x) Qe) dy, A A dx, 
0 


0 
T'(a4) ::: Γίακ.ῃ) 


(8.14) 
Γίαι + +++ + Aky) 


= D(a, ... 5 k41) = 


This integral is known as type-2 Dirichlet integral. 
Arbitrary product moment, Ε xt d ex 1, is available from the type-2 Dirichlet in- 


tegral by replacing a; by αι +h, j=1,...,k and ay, by ἄχη — hy — + - hy. That is, 


8.3 Some multivariate densities — 231 


πω ώς κα ο Iac ecc (815) 
I(a) Γίάχ) T(Qx41) 
for R(aj c h)) > 0,j 2 L..., k, R(ay 47h; -+ - hj) > 0. This means that the product mo- 


ment can exist only for some values of h4, ... , họ. Since arbitrary moments uniquely de- 
termine the distribution in this case, from (8.14) and (8.15) it is clear that if x4, ... , x, are 
jointly type Dirichlet distributed then any subset therein will also be type-2 Dirichlet 
distributed. 


Result 8.3. If x,,...,x are jointly type-2 Dirichlet distributed, then all subsets of 
Xy, ...,Xy are type-2 Dirichlet distributed and individual variables are type-2 beta dis- 
tributed. 


Result 8.4. If x, ...,x, are jointly type-2 Dirichlet distributed, then y, = NEUE 
and y, = eee are type-1 beta distributed. 


1 


For proving this result, let us take the h-th moment of ERREA 


That is, 


for arbitrary h. 


1 h 
SS ΞΕ[1-χ|1------χχ]γ} 
L4xXy te +X, 


-ID : AaA(9? / (f? aA 
= [D(as. ....a54,44)] E © Ape 


XX a pex) ee da A ec dos 
-1 Γ(αι)::: Γίαχµι +A) 

T(Q, +- + αχ +h) 
- Εαν +h) Γίαι + + aky) 

Γίακ) Τίαι +- + Oya, +h) 


= [D(a;; ... 0450,44) 


which is the h-th moment of a type-1 beta random variable with the parameters 
(akp 0 +--+ +a). Hence the result. The second part goes by observing that the second 
part is y; = 1—y,, and hence the result. Here, the parameters are (αι + :-: + ἄχ, k41). 
There are various generalizations of type-1 and type-2 Dirichlet densities. Two 
forms which appear in reliability analysis and life-testing models are the following: 


04-1 05-1 
fiio seuy (Lx) uo 


x (1 --χι -xj «x1 


xoa as aoe, ds. usen (8.16) 
04-1, 05-1 05-1 

Pas ο = Eq x2 (qd)? 
x (x, + Xp & x3) «xd 


x(-x----xjfMerl (Q..,x)e€0 (8.17) 


232 — 8 Some multivariate distributions 


where c, and c, are the normalizing constants. For evaluating the normalizing con- 
stants in (8.16) and (8.17), start with integrating from Χχ, x, 4 to χι. In (8.17), make 
the substitution u, = x,, u; = x, + x; etc. One generalization of type-2 Dirichlet is the 
following: 

BGs = i Me tx) δικο... 


x (1g 4 Εχει) Peor 


Xx (LEX + Ey) tt Bid) (8.18) 
for 0 < x; < oo, i=1,...,k. For evaluating the normalizing constant c3 start integrat- 
ing from Χχ, X,_1,..-,X;. These computations and evaluations of the corresponding 
marginal densities are left as exercises to the students. 

Before concluding this section, let us look into the meaning of largest and smallest 
random variables. 


Example 8.5. Let x;,x,,x4 be independently distributed exponential random vari- 
ables with mean values 445,45, A5! , respectively. Let γι = max(x;,x5,x3) and y; = 
min(x;, x;, x3). Evaluate the densities of y; and y;. 


Solution 8.5. The student may be confused about the meaning of largest of a set of 
random variables when x;,x;, x; are all defined on [0, co). Let one set of observations 
on {Χι,Χ2,Χ:} be {2,8,5}, another set be {10,7.2,1}, yet another set be (2, 4.2,8.5). The 
largest of these observations from each set are {8, 10, 8.5) and the smallest are (2, 1,2]. 
If (8,10,8.5, ...} are observations on some random variable γι, then γι is called largest 
of X1, X5, X4 ΟΥ y; = max(x,, x5, x3). Similarly, if (2, 1,2, ...} are observations on some ran- 
dom variable y; then y; = min{x,, x7, x3). Let the densities and distribution functions 
ofthese be denoted by hy, (yj), fy, (2), Εγ Οι)» Fy, (92). Ifthe largest y, isless than a num- 
ber u, then all x;,x;, x4 must be less than u. Similarly, if the smallest one y; is κα 
than v, then all must be greater than v. But Fy (u) = Pr{y, < u} and fj u) = u). 
Similarly, 1 -- Fy (v) = Prly; > vj. But due to independence, 


ái F,( 


Εν (u) = Prb x u} Prix; < u} es «uj 
un λε A dx. i" [Ti- e^^"] 


f (0-4 E (u) = dae! iM (Ay + Ape tu 


=(A FA; e asi 5 (À, + A )e ΡΕ 


+ (Ay Ap + Ae tetas 


for O x u < co and zero otherwise. Similarly, 


8.4 Multivariate normal or Gaussian density —— 233 


d d Pr (9 4 ax 
f,M= puis zu Aye Àj x, 
d 


d L (e rez) = (A, + Ay «λε Civ, 0v coo 
v 


and zero elsewhere. 


Exercises 8.3 


8.3.1. Evaluate the normalizing constant c, in (8.16). Then evaluate the joint marginal 
densities of (x4, ...,X, 4), (Χι»...»Χχ. 2)» 3 

D ]. 

8.3.3. By using Exercise 8.3.2, or otherwise, show that x, in the model (8.17) can be 
written equivalently as a product of independently distributed type-1 beta random 
variables. (Hint: Take E (xl) and look at the decomposition of this gamma product.) 


8.3.2. For the model in (8.17) evaluate E [xt 


8.3.4. Evaluate the normalizing constant c; in (8.17). 
8.3.5. Evaluate the normalizing constant c; in (8.18). 


8.3.6. Take the sum u = x, + --- +x;,, the sum of type-1 Dirichlet variables. In Result 8.2, 
i is shown that u is type-1 beta variable. By using the fact that if u is type-1 beta, then 
L- 7 and z τη are type2 beta variables write down the results on (x4, ... , xy). 


8.3.7. It is shown in Result 8.4 that u = moe is a type4 beta if Ἐν ο Pc a 


type-2 Dirichlet distribution. Using the fact that if u is type-1 beta, then τ; and ;— are 
type2 beta distributed, write down the corresponding results for (x, ... ro 


8.3.8. Using Exercises 8.3.6 oud 8.3.7 and by using the properties that if w is type-2 
beta, then 1 is type-2 beta, nz zy is type-1 beta, — τη 15 type- beta write down the corre- 
sponding results on (x4, ...,x,) when x, ..., x, have a type2 Dirichlet distribution. 


8.3.9. If (x1, ..., x) istype4 Dirichlet, then evaluate the conditional density of x, given 
χεχε, 


8.3.10. For k = 2, consider type-1 and type-2 Dirichlet densities. By using Maple or 
Mathematica, draw the 3-dimensional surfaces for (1) fixed a,,a, and varying a3; 
(2) fixed a5, a4 and varying αι. 


8.4 Multivariate normal or Gaussian density 


As discussed earlier, there is nothing called a unique multivariate analogue of a uni- 
variate normal density. But the following form is used as a multivariate analogue due 
to many parallel characterization results and also due to mathematical convenience. 


234 —— 8 Some multivariate distributions 


Let X' = (x,, ... Xp) be the transpose of the column vector with elements x4, ... Xp Con- 
sider the following real-valued scalar function (1 x 1 matrix) of X, denoted by f (X): 


FX) = cei -W'z X-H (8.19) 


for -oo < x; < 00, -oo < Hj < 00, j= 1,...,p, μ' = (μν....»μρ)»Σ-Σ' » O (positive definite 


p x p matrix), where ji is a parameter vector, X > O is a parameter matrix. Parallel to 
the notation used for the scalar case, we will use the following notation. 
Notation 8.1 (p-variate Gaussian). 
X-N,(,E, Σ20 (8.20) 
meaning that the p x 1 vector X is normal or Gaussian distributed as p-variate nor- 


mal with parameters ji and X with = > O (positive definite). 


In order to study properties of p-variate Gaussian as well as generalizations of 
p-variate Gaussian and other generalized densities, a few results on Jacobians will be 
useful. These will be listed here as a note. Those who are familiar with these may skip 
the note and go to the text. 


Note 8.1 (A note on Jacobians). Before starting the discussion of Jacobians, some 
basic notations from differential calculus will be recalled here. 


Notation 8.2. ^: wedge product. 


Definition 8.1 (Wedge product or skew symmetric product). The wedge product or 
skew symmetric product of differentials is defined as follows: 


dxAdy=-dyAdx = dxAdx=0, dyAdy=0. 


Now let y, =f, (X1 X2), Y2 = f508, X?) be two functions of the real scalar variables x, 
and x. From differential calculus, 


dy, = —dx, + —dx i 
Yi Ox, 1+ OX, : () 
of. of. " 
dy, = —dx 2 dx 
y2 OX, 1+ OX) : (D 
where H denotes the partial derivative of f; with respect to xj. Then 
J 


zz) Slay. τμ | [2s fp 
dy Ady - | 3d + Σρι πο] ^ [22s + 52d 
_ 9} ο ax ^dx , Ai ορ 


--- LI— Ldx,^dx 
OX, OX, 1 : OX, OX; : 


8.4 Multivariate normal or Gaussian density —— 235 


of, of, of, of, 
L— dx, A = dx, Adx 
A OX; Οχι * ud OX OX; * 2 
= [2 ο ndxy +040 
OX, 0X) OX> OX, 
Ξ on of dx, ^ dx; = Jdx, ^ dx; 
OX, — OX, 


where / is the Jacobian of the matrix of partial derivatives (95. In general, if yj7 
J 


(χι)... Χρ), J =1...,p and the matrix of partial derivatives is (—) then the Jacobian 

hose ο) d th ix of partial derivatives is (2+) then the Jacobi 
J 

is the determinant 


1-|(88) (8.21) 


OX; 
When J + 0, then we have 
dy; ^ A dyp =Jdx, ^ +++ Λάχγ. 
1 (8.22) 
dx, ^: A dx, = ju A+++ AdYp. 


As an application of (8.22) we will evaluate a few Jacobians. 


Result 8.5. Let x,,...,x, be distinct real scalar variables and αγ» be constants. Con- 
sider the linear forms: 


yi = auX; + 12X3 + `+- + ἄιρχρ 
y2 = 051 ΠΡ α22Χ2 a PBS wr αορχρ 


Vp αγιχι τα 


λος + AppX 


pp*p: 
We may write this as Y = AX, Y' -(γι»...»γρ), Χ' = 06,...,x5),A = (αι). Then 
Y-AX, |A|+0 > dY=|Aldx 


where dY = dy; ^ dy; ^ ++: Ady, and dX = dx; A--- ^ dx. 


The proof is trivial. We = a, and the Jacobian is the determinant or J = |A|. When 
7 


|A| # 0, then the transformation is one to one. 
Y-AX, |AJ40 2 X=A'Y. 


We may generalize Result 8.5 for more general linear transformations. Consider a mxn 
matrix X of distinct or functionally independent real scalar variables. Here, the wedge 
product in X will be of the form: 


dX 2dxi ^: ^Adxi, A dx, A^ i A dx 


Let A be m x m non-singular matrix of constants. Then we have the following result. 


236 — 8 Somemultivariate distributions 


Result 8.6. For the m x n matrix of distinct real scalar variables and A a m x m non- 
singular matrix 


Y-AX, |AJ40 = dY=|Al"dX. 


Proof. Let Yo) Yo, ..., Yi; be the columns of Y, Xa)... -Xin be the columns of X. 
Then 


Y-AX > (Yay, -> Y) = (AX, ..., AX (ny). 


Now if we look at the string of variables in η, then in Yo, ..., in Y, and the cor- 
responding variables in X(1),...,X(,). Then the matrix of partial derivatives will be a 
diagonal block matrix of the form: 


0 


ο .. 0 
Α 
: A 


A 
ο 


where the diagonal blocks are A each and there are n such A's, and hence the deter- 
minant is |A|". This establishes the result. 


Now let us see what happens if we post-multiply X with a non-singular n x n ma- 
trix B. This will be stated as the next result. 


Result 8.7. Let X be a m xn matrix of distinct real scalar variables and let B be n xn 
non-singular matrix of constants. Then 


Y-XB, |B|20 => dY-J|B|"dx. 


For proving this result, consider the rows of X and Y and proceed as in the case of 
Result 8.6. Now, combining Results 8.6 and 8.7 we have the following result. 


Result 8.8. Let Y and X be m x n matrices of real distinct variables. Let A be m x m 
and B ben x n non-singular matrices of constants. Then 


Y=AXB, ΙΑΙ10,|8|10 =  dY-|Al"|BI"dx. 


For proving this result, use Results 8.6 and 8.7. Put Z = BX and Y = AZ, compute 
dY in terms of dZ and dZ in terms of dX to prove the result. 

Now we shall consider a linear transformation on a symmetric matrix. When a 
p x p matrix X = (x) of real scalar variables is symmetric then we have only p(p + 1)/2 
distinct real scalar variables. Then the product of differentials will be of the form: 


dX = dxy A^ +++ Ad; A dx; A+++ ^ do, ^ ^ dx. 


8.4 Multivariate normal or Gaussian density —— 237 


Result 8.9. Let X, A be p x p, X = X' = (xy) be a matrix of p(p + 1)/2 distinct real 
scalar variables and let A be a non-singular matrix of constants. Then 


Y=AXA', Χ-Χ',|Α|10 => dY-|Ap*dx. 


We can prove the result by using the fact that any non-singular matrix can be 
represented as a product of elementary matrices. For the proof of this result as well as 
those of many other results, the students may see the book [3]. We will list two results 
on non-linear transformations, without proofs, before closing this note. 


Result 8.10. Let the p x p matrix X be non-singular so that its regular inverse X 1 
exists. Then 


|X|-?dX fora general X 
eS Me ia ρα 
ΙΧ; 979 for X' 2 -X. 


Result 8.11. Let the p x p matrix X be symmetric and let it be positive definite with 
pr distinct real variables. Let T = (t) be a lower triangular matrix with positive 
diagonal elements, tj; > 0, j =1,...,p and {χ 5, i >j being distinct real variables. Then 
the transformation X — TT' is one to one and 


p 
^ m = p*1-j 
X-T, A= kond = χ-2/[]ή lar. 
J= 


With the help of the above Jacobians, a number of results can be established. Some 
applications to statistical distribution theory will be considered next. 
We shall evaluate the normalizing constant in the p-variate normal density. Let 


Y-X3(X-]) = ἀγ-[στάχιάα - ñ) = dX 


since ĵ is a constant vector, where Σ-2 is the positive definite square root of ="! > 0. 
Now, we shall evaluate the normalizing constant c. We use the standard notation k 
which means the integral over X. Then 


l= | F(X)AX = e| e- A-M X-MAX 
x X 
= cz |. erar, Y -x-i(X-j) 
because, under the substitution Y = x (X - ji), 
(X - yz - ji - (t - jy z-ix 3t - ji) -Y'Y 
γι. γρ Y (egy). 


238 —— 8 Some multivariate distributions 


Here, h means the multiple integral ee bes (aa But 
POLY eai La ισα 
| e 27 dz = 2| e 3? dz 
—co 0 


due to even and convergent functions, and it is 


- IN uzte" du (Put u= 5) 
a 2 
Aem = 


| e^;Y'Ygy = (2π)7. 
Y 


Hence 
- 1 
Ix? Qm 


Therefore, the p-variate normal density is given by 


X= ew, Σ»ϱ, 
[Σ]2(2π)2 


What is the mean value of X, that is, E(X), and what is the covariance matrix of X, that 
is, Cov(X)? By definition, 


E(X) = | 


Xf (X)AX = | (X - ñ+ jf QOdX 
X X 


=a | rooax | x -rooax 
X X 
-* [ία - Wf XX 
since the total integral is 1. Make the same substitution, 
Y-X3(X-j) => dY=|E A(X - ñ) = |Z ΣἀΧ. 
Then 
Υο-2ΥΥάΥ 


E B 1 
|} (X - Bf QOdX = ρα R 


where the integrand is an odd function in the elements of Y and each piece is conver- 
gent, and hence the integral is zero. Therefore, 


E(X) - 


or the first parameter vector is the mean value of X itself. The covariance matrix of X, 
by definition, 


8.4 Multivariate normal or Gaussian density —— 239 
Cov(X) = E[X - E(X)][X - E(X)]" = EIX - ΠΧ - ja! 
1 1 
ΞΣΣΕ(ΥΥ’)Σ2, 
under the substitution Y = =~? (X - ji) 


Cov(X) =Z? {| yr'eiraylzi, 
Y 


But 
y) y Yi -» p 
μ.ο 
Yp YpVi Y2 = Yp 


The non-diagonal elements are all odd functions, and hence the integrals over all non- 
diagonal elements will be zeros. A diagonal element is of the form: 


Gum | 2:2 | yle 201+ YP) dy, Ace Ady, 

-=| ye vay] =| e μία] 
2π ui : i#j=1 271 ὁ οὐ 
ΚΩΝ 

= | — γα dy, ----- e 2% dy 

| 2n Jo 7 i L 271 -οο 
But 
1 | 2--γ v2 fe: ilr 1 2 
EL le e 2/ dy; = — t? ^e'dt undert- zy: 
27 be " γ2π 0 25 
_ v2 1(4)= = 
(vm \2/ Vin 


Thus each diagonal element gives 1, and thus the integral over Y gives an identity 
matrix and, therefore, 


Cov(X) == 


which is the parameter matrix in the density. Thus the two parameters in the density 
are the mean value and the covariance matrix. 

Let us evaluate the moment generating function. Let T’ = (t, ... ,tp) be the vector 
of parameters. Then by definition the moment generating function 


Μίε,....ἐν) = My(T) = E[eh** 9] = E[eT X] 
= E[eT XD] = eT'ag [eT XD] 


240 — 8 Some multivariate distributions 


EPUM, 
-elRhg[elX?Y] fory 2 x-:(X- ji) 


M T 
cp d | er 2i Y-iY'Yay. 
Y 


(2m)? 
But 
pee 1 ! 1 ! 1553 
(T'Z3)Y Y= ju Y 2T'»:Y] 
--5(y - xim) (Y -ziT) - T'z). 
Therefore, 


My) = gases | 1 : e- i-z W227) ay 
Y (271) 2 


- Qf fj T'2T (8.23) 


since the integral is 1. It can be looked upon as the total integral coming from a 
p-variate normal with the parameters (521, I). Thus, for example, for p = 1, È = 
01, = 07, T' = t, and, therefore, 


MT) = exp | inh + ter}. (8.24) 


This equation (8.23) can be taken as the definition for a p-variate normal and then in 
this case Σ can be taken as positive semi-definite also because even for positive semi- 
definite matrix Σ the right side in (8.23) will exist. Then in that case when Σ is singular 
or when |Σ| = 0 we will call the corresponding p-variate normal as singular normal. For 
a singular normal case, there is no density because when X on the right side of (8.23) 
is singular the inverse Laplace transform does not give a density function, and hence 
a singular normal has no density but all its properties can be studied by using the mgf 
in (8.23). 
For p - 2, T' = (t, t), i = (Qu, ju), T = Gu + 1; 


Σ- [- 2 2 | oi | 


σοι On P0720; 03 
2 
σ 0,05 t 
ΤΣΤ = Itota | 1 P zal | 
ροσι 9 b 


=0?t? + 2p0,05t,t; + 0563; 


1 
My(T) = exp (tinh + bu) + ; (tei + 2t 000,0, + 803) (8.25) 


where p is the correlation between x, and x, in this case. 
One important result in this connection is that if X ~ N, (jt, £), then any linear func- 


tion of X, say u = axi + +++ + aX, = α’Χ-Χ'α, α' = (a, ... »Ωρ)» is a univariate normal. 


8.4 Multivariate normal or Gaussian density —— 241 


Result 8.12. If X ~ ΝΡ(Π.Σ), then u = a,x, + ++: + ayxy = α X = X' a is univariate nor- 
mal with the parameters u = E(u) = a’ ñ and o? = Var(u) = a’ Xa. 


This can be proved by looking at the moment generating function. Since u is a 
function of X ~ ΝΡ(Π, Σ) we can compute the moment generating function of u from 
the moment generating function of X. 


M,(t) = E[e™] = Efe] = E[eTX], Τ' = ta’ 


a el ie 5T'ZT = eta θη 5t (a Za) 


But this is the mgf of a univariate normal with parameters a' ji and a' Xa. Hence u ~ 
Νι(α’ jt, a! Σα). This shows that every linear function of a multivariate normal vector X 
is univariate normal. One may also note that we have not used the non-singularity of 
Xin the proof here. Hence the result holds for singular normal case also. 

Naturally, one may ask: if a' X is univariate normal, for a given a, will X bea multi- 
variate normal? Obviously, this need not hold. But if a is arbitrary or for all a if u = a' X 
is univariate normal, will X be multivariate normal? The answer is in the affirmative. 
This, in fact, also provides a definition for a multivariate normal law as the law satis- 
fied by X when a'X is univariate normal for all constant vector a. 


Result 8.13. For any vector random variable X and for a constant vector a, if u = 
α’ X = X'a is univariate normal for all a, then X is multivariate normal X ~ N, (ñ, 2), 
X20. 


The proof is available by retracing the steps in the proof of the previous result. If 
u = a! X is univariate normal, then its parameters are E[u] = a' ji and Var(u) = a' Xa. 
Therefore, the mgf of u is available as 


M,(t) = ef +5 ra) 
ο 1 tg! = (ta,,...,tay) (8.26) 


where d, ...,a, are arbitrary, and hence ta, ...,fa, are also arbitrary. There are p ar- 
bitrary parameters here in (8.26), and hence it is the mgf of the vector X. In other 
words, X is multivariate normal. Note that the proof holds whether X is non-singular 
or singular, and hence the result holds for singular as well as non-singular normal 
cases. 


Definition 8.2 (Singular normal distribution). Any vector random variable X hav- 
ing the mgf in (8.23), where X is singular, is called the singular normal vector X and 
it is denoted as X ~ N, (f, £), £ 20. 


As mentioned earlier, there is no density for singular normal or when |È| = 0 but 
all properties can be studied by using (8.26). 


242 —— 8 Some multivariate distributions 


Since further properties of a multivariate normal distribution involve a lot of ma- 
trix algebra, we will not discuss them here. We will conclude this chapter by looking 
at a matrix-variate normal. 


8.4.1 Matrix-variate normal or Gaussian density 


Consider a mx n matrix X of distinct mn real scalar random variables and consider the 
following non-negative function: 


Γιὰ) = βρε; tr[A(X- M)B(X- M)! ] (8.27) 


where X and M are m x n, M is a constant matrix, A is m x m and Bis n x n constant 
matrices where A and B are positive definite, X = (xj), M = (mj), -οο < xj < oo, -00 < 
mi < oo, i=1,...,m;j=1,...,n and c is the normalizing constant. 

We can evaluate the normalizing constant by using the Jacobians of linear trans- 
formations that we discussed in Note 8.1. Observe that any positive definite matrix 
can be represented as CC’ for some matrix C, where C could be rectangular also. Also 
unique square root is defined when a matrix is positive definite. Let A? and B? de- 
note the unique square roots of A and B, respectively. For the following steps to hold, 
only a representation in the form A = A,Aj and B = B, Bi, with A, and B, being non- 
singular, is sufficient but we will use the symmetric positive definite square roots for 
convenience. Consider the general linear transformation: 


Y-AXX-M)B = dY-|AUIBI? d(X - M) = |A|? |B|? dX 


since M is a constant matrix. Observe that by using the property tr(PQ) = tr(QP) when 
PQ and QP are defined, PQ need not be equal to QP, we have 


tr[A(X — M)B(X - M)! ] = tr[A? (X - M)B3 B3 (X - M)' ΑΣ] 


-t(YY') -Y Y yj. Y = Yy). 


i=l j=l 
But 


οο 1,2 1 I 
| e 2% dz=\V2n > | e 2 "Yay 
—co Y 
Ξ(2π)2. 


Since the total integralis 1, the normalizing constant 


AI? |B]? 
εξ ——mi-- 
(27)? 
That is, 
f(X)= ΕΠΕ ΙΒ; e 3IAQ-M)BQC-MY] (8.28) 


Qm)? 


8.5 Matrix-variate gamma density —— 243 


forA-A' 50.8 - P' > 0, X = (xj), M = (mj), -co < Xj < oo, -oo < My < oo, i= 1,... m; 


j=1,...,n. The density in (8.28) is known as the matrix-variate Gaussian density. Note 
that when m - 1 we have the usual multivariate density or n-variate Gaussian density 
in this case. 


Exercises 8.4 
8.4.1. If X ~ N, (ji, £), = > 0 and if X is partitioned as X's (Xt, X) where Χμ is r x 1, 
r«pandif È} is partitioned accordingly as 


» Σ 
x- ( i EL Xj = Cov(X() 20, 


Σχ = Cov(Xo) > 0, Z = Cov(X), X), Zn = Z5,. Then show that 


Xa) ~N, Mau) Eg 20 (i) 
Xo ~ N,,(45,25) Σ220 (ii) 
where ji' = (u Hio), Hay isr x Land ugo is (p-r) x 1. 


8.4.2. In Exercise 8.4.1, evaluate the conditional density of X. given X», and show 
that the it is also a r-variate Gaussian. Evaluate (1) E[X(|X(5], (2) covariance matrix 
of X; given Xo. 


8.4.3. Answer the questions in Exercise 84.2ifr -1,p-r-p- 1. 


8.4.4. Show that when m = 1 the matrix-variate Gaussian becomes n-variate normal. 
What are the mean value and covariance matrix in this case? 


8.4.5. Write the explicit form of a p-variate normal density for p - 2. Compute (1) the 
mean value vector; (2) the covariance matrix; (3: correlation p between the two com- 
ponents and show that -1 < p < 1 for the Gaussian to be non-singular and that when 
p = +1 the Gaussian is singular. 


8.5 Matrix-variate gamma density 
The integral representation of a scalar variable gamma function is defined as 
Γ(α) = ie x*le*dx, R(a)>0. 
Suppose that we have a p x p matrix X = X' of p(p + 1)/2 distinct real scalar variables. 


Further, let us assume that X is positive definite. Consider the integral of the form: 


r,)- | xF ο tax, (a) s P, (8.29) 
X»0 2 


244 —— 8 Somemultivariate distributions 


where |X| means the determinant of X, tr(X) is the trace of X, | o means the integra- 
tion over the positive definite matrix X and dX stands for the wedge product of the 
p(p + 1)/2 differential elements dx,;’s. Observe that (8.29) reduces to the scalar case of 
the gamma function for p - 1. Let us try to evaluate the integral in (8.29). Direct in- 
tegration over the individual variables in X is not possible. Even for a simple case of 
p =2, note that the determinant 


Xu Xp 


|X| = 
Xp X» 


- 2 = 
-XgX5»-Xi Π(Χ)Ξχιι X» 


and the integration is over the positive definite matrix X, which means a triple integral 
over X11, X22, X? Subject to the conditions x; > 0, x5; > 0, X1X22 — x7; > 0. [Observe that 
due to symmetry x4 = χι. Evaluation of this triple integral, that is, the evaluation of 
(8.29) for p -- 2 is left as an exercise to the students. 

The integralin (8.29) can be evaluated by using Result 8.10 in Note 8.1 for the non- 
linear transformation X = TT’. Let us make the transformation X = TT’ where T = (t) 
is a lower triangular matrix with positive diagonal elements, that is, t; > 0, j - 1, ..., p, 
tj = 0, for all i < j. Under this transformation, 


ja 
1 Pp j 
ΙΧ 7 dX = [Tree lar and 
ja 
τρ(α) - | ir Fe eax 
X 


We need to evaluate only two types of integrals here, one type on tj and the other type 
on fj. That is, 


under the substitution u = tå, 


-1(a- 5). Ra) > Z 
2 2 


and then the integral 


forj =1,...,p, and hence the final condition will be R (a) > E and the gamma product 
is then 


8.5 Matrix-variate gamma density —— 245 


In [T; <j there are ΒΕ factors and each factor is the integral 


em 2 
| e "idt; = Υπ 


-οο 


and thus this product gives (ym) T - (n) . Therefore, the integral reduces to the 
following: 
Γ.(α) "IM r@r(a- ;) ο 27) (8.30) 
p 2 2 


-1 
for R(a) > F7. 


Notation 8.3. Γρ(α): Real matrix-variate gamma function. 


Definition 8.3 (The real matrix-variate gamma). The following are the definition 
of I, (a) and its integral representation: 


= 1 =l = il 
Γρ(α) =n" r(at(a- 5) --Ἰζα- B) R(a) > P 
2 2 2 
" =í 
5 | ΧΙ ο Max, Ra) >P. 
X»0 2 

We can define a matrix variate gamma, corresponding to T, (a), when the elements in 
thepxp matrix X areinthe complex domain. This will be called complex matrix-variate 
gamma as opposed to real matrix-variate gamma. This will not be discussed here. For 


those students, who are interested in or curious to know about random variables on 
the complex domain, may see the book [3]. 


For example, for p = 2, we can obtain the integral from the formula (8.30). That is, 
1 1 1 
T (a) 272 ra@r(a - 2), R(a) > =. 
2 2 
For p = 3, 
Γ-(α) -π’ raor(a - ra -1, R(a)>1. 


By using the integral representation of a real matrix-variate gamma, one can define a 
real matrix-variate gamma density. Let us consider the following function: 


F(X) = |X|" e em 


for X = X' > O, B = B' > O where the matrices are p x p positive definite and B is a 
constant matrix and X is a matrix of p(p + 1)/2 distinct real scalar variables. If f (X) isa 


246 —— 8 Somemultivariate distributions 


density, then let us evaluate the normalizing constant c. Write tr(BX) - tr(BXB2) by 
using the property that for two matrices A and B, tr(AB) = tr(BA) as long as AB and BA 
are defined. Make the transformation 


Y=B2XB2 => dX=|B av 
by using Result 8.9. Also note that 
ΙΧ ΑΧ = |B YF av. 
Now integrating out, we have 
| |Xj4- 7 e- MBX) ax 
X20 
-|Br* | IY 7 e- tay 
Y>0 
= |BI“T, (a). 
Hence the normalizing constant is |B|"/T, (a) and, therefore, the density is given by 


E -tr(BX) 
F(X) = ——IXI e (8.31) 
Γρία) 
for X =X' > O, B=B' > O, R(a) > = and zero otherwise. This density is known as the 
real matrix-variate gamma density. 

A particular case of this density for B = Ir, X»0anda- 25 n=p,p+1,... is 
the most important density in multivariate statistical analysis, known as the Wishart 
density. By partitioning the matrix X, it is not difficult to show that all the leading sub- 
matrices in X also have matrix-variate gamma densities when X has a matrix-variate 
gamma density. This density enjoys many properties, parallel to the properties en- 
joyed by the gamma density in the scalar case. Also there are other matrix-variate 
densities such as matrix-variate type 1 and type 2 beta densities and matrix-variate 
versions of almost all other densities in current use. We will not elaborate on these 
here. This density is introduced here for the sake of information. Those who are inter- 
ested to read more on the matrix-variate densities in the real as well as in the complex 
cases may see the above mentioned book on Jacobians. 


Exercises 8.5 


8.5.1. Evaluate the integral i j e "dX and write down the conditions needed for 
the convergence of the integral, where the matrix is p x p. 


8.5.2. Starting with the integral representation of Γρ(α) and then taking the product 
T, (a)T, (B) and treating it as a double integral, show that 


pu ΠΠ Γυ(α)Γ(β) 
XET I- X77 dx = 22——— 
M ESSO Τρία +B) 


8.5 Matrix-variate gamma density —— 247 


and write down the existence conditions of the integral, where O « X « I means that 
the p x p matrix X is positive definite and I — X is also positive definite where I is 
the identity matrix. [Observe that definiteness is defined only for symmetric matrices 
when real and Hermitian matrices when in the complex domain.] 


8.5.3. Evaluate the integral fho- χο dX by using Exercise 8.5.2 or otherwise, where X is 
p x p real and verify your result for (1) p = 1; (2) p = 2; (3) p = 3 by direct integration as 


multiple integrals. 


8.5.4. Evaluate the integral I ye Il |dX where X is p x p, and verify your result for 


(1): p - 1, (2: p = 2 by direct integration as multiple integrals. 


8.5.5. Evaluate the integral fos zej µ - X|dX where X is p x p, and verify the result by 


direct integration for (1) p = 1; (2) p = 2. 


9 Collection of random variables 


9.1 Introduction 


We had come across one collection of random variables called a simple random sam- 
ple, where the variables were independently and identically distributed, iid variables. 
First, we look at some properties of scalar variables, which will be used in the discus- 
sions to follow. Hence these will be listed here as results. 


Result 9.1. For a real scalar variable x, let E(x) = u, Var(x) = 0? < oo. Then 


Pr{|x - u| > ko} < EX k » 0. (9.1) 


This result says that if we are ko away from the mean value μ then the total prob- 
ability in the tails is less than or equal to ο. This result is also known as Chebyshev's 
inequality. The variables could be discrete or continuous. The probability in the tail is 
marked in Figure 9.1. 


H- ko p. u+ko Figure 9.1: Probability in the tail: Chebyshev's inequality. 


Then the probability in the middle portion is available from (9.1) as one minus the 
probability in the tails. That is, 


Pr{|x - u| xko] 2 1- = (9.2) 


kz 
If we replace ko by some Κι then (9.1) and (9.2) can be written in different forms: 
Pr{|x - u| > ko] < 
Pr(x-u| >k} € 


Pr{|x - u| < ko} > 1- 


Pr{|x -u| xk) 3 1- (9.3) 


All these forms in (9.3) are called Chebyshev's inequalities. The proof is quite simple. 
We will illustrate the proof for the continuous case. For the discrete case, it is parallel. 
Consider the variance σ΄. 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. ABARH] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https: //doi.org/10.1515/9783110562545-009 


250 — 9 Collection of random variables 


σ-- Γ (x - uf (x)dx 
p-ko 5 οο 5 
=|" = wrod [^ -wrod 
-co μεΚσ 


+ko 
$ Γ (αν (x)dx. 
H 


-ko 


Let us delete the middle portion. Then the left side must be bigger than or equal to the 
balance on the right. That is, 


—ko co 
o> I (x - uf Godx + | αμ Godx. 
-00 u+ko 


In these integrals in the tails |x — u| > ko, and hence if we replace |x — u| by its lowest 
possible point in these two intervals, namely ko then the inequality must remain in 
the same direction. That is, 


—ko co 
ois [ (ko f G)dx + | (kof (x)ax 
2 μεΚσ 


oco 


-ko οο 
> aeo) | foodx 1 | foar] 
οο Ut+ko 
= (k’o’) Pr{|x - u| 5 ko}. 
Dividing by σ΄ on both sides we have the inequality 


1 
Pr{|x - u| > ko] < m 


which holds for all non-degenerate random variables with finite variance. Since we 
divided by c? the step is valid only if the variable is non-degenerate. From this result, 
the above four results in (9.3) are now available. 


Result 9.2 (Chebyshev's inequality). For real non-degenerate scalar random vari- 
able x for which the variance o? is finite, the inequalities in (9.3) hold. 


From the procedure above, it is clear that similar results will hold true if we take 
any other distance measure. Consider the distance 


M, = [E(Ix - 4"] 


and let us look at the tail areas after k times M, from the mean value, that is, Pr{|x -u| = 
kM,}. Then proceeding as above, we have 


1 
r 


= Miz-E(lx- ul) 


1 
Pr{|x - u| > kM,} < " νο... 
7 


Pr{lx—pl>k} <1- T. red323.. (9.4) 


Hence we have the following result. 


9.1 Introduction —— 251 


Result 9.3. For non-degenerate real scalar random variables for which the r-th ab- 
solute moment about the mean value exists then the inequality in (9.4) and the corre- 
sponding four inequalities, parallel to the ones in (9.3) hold. 


In the above procedure, we are deleting the middle portion of the probabilities 
and then replacing the distance measure by the lowest point in the interval. Then if 
the lowest point always remains positive, we can extend the idea to positive random 
variables and obtain an inequality in terms of the mean value. 

Let x be a real scalar positive random variable with mean value denoted by u so 
that u > 0Ο. Let a be an arbitrary positive number. Then 


p= w χ[ρθάχ 


a co 
= | xf (x)dx + | xf (x)dx. 
0 a 

Here, all quantities involved are non-negative. Hence if we delete the integral 
i xf (x)dx then the balance should be less than or equal to μ. If the deleted area 
is zero, then it will be equal. Then 


A2 | xf (x)dx. 
a 

But in the interval [a, co) the value of x is bigger than or equal to a. Hence if we replace 

x by a then we should get a quantity still less than or equal to the previous quantity. 

Therefore, 


Az i» af (x)dx =a ας 


a a 
=aPr{xza} > 


Prixza}< lati 
a 


Result 9.4. For non-degenerate positive real scalar random variables for which the 
mean value y exists, then for any positive number a, 


Prix>a}< = (9.5) 


The inequality in (9.5) is often known as Markov’s inequality. Thus, if we have iid 
variables with finite common variance o? and mean value p, then we have shown in 
(716) of Chapter 7 that 


Vax) s cs X= " , EX =n 


252 — 9 Collection of random variables 


Then from Chebyshev's inequality in (9.3), it follows that 


Ξ Var(x) σ᾽ 

Prix -u| £k] 21- ae 

r{|x-p| xk] z m ud 
1 whenn- oo. (9.6) 


Thus it follows that with probability 1, x goes to u. This is known as the weak law of 
large numbers. 


9.2 Laws of large numbers 


Result 9.5 (The weak law of large numbers). If x;,...,x, are iid variables with a 
common finite variance o? and common mean value y, and if x = ~~ is the sam- 
ple mean, then the sample mean goes to the population mean value u with probability 


lor 
lim Pr{|x—p|<k}=1 
noco 

for k » 0. 


This is the limit of a probability. This can also be looked upon as the stochastic 
convergence of x to its mean value u or convergence in the sense of probability. The 
phrase “weak law” suggests that there is a strong law, which will be a statement on the 
probability of a limit, which will not be discussed here because it needs more analysis 
for its proof. A simple illustration of the weak law and its consequence can be seen 
from Bernoulli trials. 

Consider n independent Bernoulli trials. In each trial, the result is either 1 (suc- 
cess) or O (failure). If x; denotes the outcome in the i-th trial, then x,,...,x, are n iid 
variables taking values 1 with probability p and O with probability q = 1— p, or it is a 
simple random sample of size n from a Bernoulli population. Then sum x, 1--:: 4 x, will 
give x - the number of successes in n Bernoulli trials, and this x is the binomial ran- 
dom variable. Then x = * is the proportion of successes in n independent Bernoulli 
trials. What the weak law of large numbers says is that this proportion of successes 
converges to the true probability of success p when the number of trials n goes to in- 
finity. This is a very important observation. If we conduct Bernoulli trials, such as get- 
ting a head when a coin is tossed repeatedly under the same conditions, and if 40 
successes are observed in 100 trials then x = 0.4 can be taken as an estimate of the 
true probability. When the number of trials becomes larger and larger, we get a better 
and better estimate of the true probability of getting a head, and finally when n goes 
to infinity the sample proportion coincides with the true probability p. This is the ba- 
sis for taking relative frequencies as estimates for the true probability of success in 
Bernoulli trials. 


9.3 Centrallimittheorems —— 253 


9.3 Central limit theorems 


Another interesting result connected with a collection of random variables is a limiting 
property known as the central limit theorem. We will illustrate it for iid variables. Let 
Xi» ..., X, be a simple random sample of size n from some population with finite vari- 
ance σ”. Let the sample mean be denoted by x = — . Then E(x) = yu the mean value 
in the population and Var(x) = P Let us look at the standardized sample mean, ob- 
serving that for any random variable u, v = = is the standardized u so that E(v) = 
and Var(v) = 1, denoted by z. 


| X-E(X) ἆ-μ 
»ANarx) σ/νη 
_ πα - p) 


σ 


One central limiting property says that the standardized sample mean, whatever be 
the population with finite variance, continuous or discrete, will go to standard nor- 
mal or Gaussian when the sample size n goes to infinity. There are various versions of 
this limiting property depending upon the conditions that we impose. We will state a 
central limiting property under the existence of the second moment and then prove it 
by assuming that the mgf exists. 


Result 9.6 (The central limit theorem). Consider a simple random sample of size n 
from a population with finite variance 0? < co. Let x = “==, Then the standardized 


sample mean goes to standard normal when n > co. 


Proof. Letz be the standardized sample mean 


_ &-W) 
σ/γη 
_ να Σαν) i00 μ) 
σ n 


DIO 
ci » τ "| 


where μ is the population mean value and o? is the population variance, which is as- 
sumed to be finite. Let M, (f) be the mgf of χι. Since the x;'s are identically distributed, 


we have 
mo- pals] 


Taking logarithms and expanding we have 


lnM,(t) = ninMy-( zx) 


254 — 9 Collection of random variables 


. t t Εμ 

=nin[1 + eB pail) 5i qu 
1 τ 1 

-nln[1 ^ e], e= 0( 35) 


2 
when n > co. Then M,(t) > e 7 which is the mgf of a standard normal variable. Hence 
when n — co the standardized sample mean goes to a standard normal variable. 


Let us examine the consequences of this result. 

(1) If the population is normal, then the standardized sample mean is exactly a 
standard normal variable for all or for every n. 

(2) When we have a Bernoulli population and if x;,..., x, are iid Bernoulli vari- 
ables then the sample sum x = x, + --- +X, is the binomial variable because the sample 
sum gives the number of successes in n independent Bernoulli trials. Then the sample 
mean X = x - the binomial proportion with expected value and variance given by 


πο 


This means that the standardized sample mean 


om (X-W) _ (7 -p) . 
σ/νπ  4pqín npg 


which is nothing but the standardized binomial variable itself, q = 1 -- p. Hence for the 
binomial variable its standardized form will go to a standard normal when the number 
oftrials n goes to infinity. This is due to the fact that the binomial proportion is nothing 
but the sample mean when the sample comes from a Bernoulli population. 

(3) When the population is gamma with shape parameter a and scale parameter 
B, we know that the population mean value is af and the population variance is af. 
Hence 


z- K=4B) _ vn(x- ap) 
Bva|vn pva 


as n co. We had already seen that when the population is gamma then for every 
n, z is a relocated and re-scaled gamma variable and this gamma variable goes to a 
standard normal, which is an interesting result. 

The importance of this limit theorem is that whatever be the population, whether 
discrete or continuous, the standardized sample mean will go to a standard normal 
when n co and when the population variance is finite. Thus the normal or Gaus- 
sian distribution becomes a very important distribution in statistical analysis. Misuses 
come from interpreting this limit theorem in terms of x — yp or x. This limit theorem 


> N(0,1) 


9.3 Central limittheorems —— 255 


ῃ z o? : = o? 
does not imply that x - u ~ N(0, FJ for large n. It does not imply that x ~ Νίμ, x) for 
large n. 

Before concluding this section, some more technical terms will be introduced but 
a detailed discussion of these will be done in later chapters. 


Definition 9.1 (A statistic). Letx;,...,x, beiid variables ora simple random sample 
of size n from some population. Any observable function of x,,...,x, is called a 
statistic. Several such functions are called statistics, different from the subject of 
statistics. 


is a statistic. T, = Y? x7 


For example, x = j is another statistic. x and T; are 
two statistics. If the function contains some unknown parameters such as 7 (x; - 
1)? it is not a statistic because yu here is not known. But if u is known, such as p = 2 
then Y? x; - 2)? is a statistic. One important statistic is the sample mean. Another 


important statistic is the sample variance. 


XpbeX, 
n 


Definition 9.2 (Sample variance). Consider x,, ..., x, iid variables. Then 


D 
oux) 
n 


$= 


is called the sample variance. 


Definition 9.3 (Sampling distributions). The distribution of a statistic is known as 
a sampling distribution. 


If we consider the distributions of x and 57, then these are two sampling dis- 
tributions. Since when n - 1 the original population is described, the population 
distribution itself can be looked upon as a sampling distribution also. The most im- 
portant sampling distributions in statistical literature are the chi-square distribution, 
Student-t distribution and the F-distribution. Out of these, chi-square was discussed 
as a special case of a gamma distribution but it is also associated with a sampling 
distribution. Discussion of sampling distributions will be postponed to Chapter 10. 
Before concluding this section, a small property will be examined. When we have a 
simple random sample from a population with mean value j and variance o? then we 
have seen that 


2 . . 
Ε[ε]--μ and Van σ΄ Population ος 
n Sample size 


What is the expected value of the sample variance? This can be computed by us- 
ing a standard result. Note that E[x; - μ]’ = Var(x;) = o? for i = 1,...,n, and hence 
E[Y 74 GG --μ)2] = no?. Consider 


256 —— 9 Collection of random variables 


Ys - HP = Y [αι 23) + 6-10? 
i i-1 


= 04 2)? + (Kp)? «2G - y) 0 - X) 
i=1 i=1 


= Σαι -Xxy--n(X-gy since Σαι -X)-0 


i=1 i=1 


Now taking expectations on both sides, we have 


no? = Eos = d +nVar(x) => 
iz 


n 
ε[σαι | = (n - 1)07. 
i=l 
In other words, if 


Z ia et -xy 


S2 
1 n- 


then E[S7] = o? 
and this property is called unbiasedness. 


Definition 9.4 (Unbiasedness). If T is a statistic and if E[T] -- 0 for all admissible 
values of 6, then T is called unbiased for 0 or T is an unbiased estimator of 0. 


This is a desirable property in many cases. We have seen that 5? is unbiased for o? 
but S? is not unbiased for the population variance σ’. Because of this property some 
people define sample variance as 51 instead of S?. But S? should not be taken as sam- 
ple variance because it is not consistent with the original definition of variance as 
Var(u) = E[u — Ε(ι)]2. For example, take a discrete random variable x taking values 
Xi» ..., XQ With probabilities i each. Then E[x] = x and Var(x) = Xn» Further, S? 
is the square of per unit distance or dispersion of x from the point of location x and 
consistent with the idea of dispersion or scatter. Thus the proper measure to take for 
sample variance is S? and not S1. Besides, unbiasedness is not a desirable property in 
many situations. 


Some general properties on independence 


Some general properties on product probability property or statistical independence 
will be mentioned here. 

(a) If the real scalar random variables x and y are independently distributed, then 
u=ax+b,a#0Oandv=cy+d,c + 0 are also independently distributed. 

(b) If the real scalar random variables x and y are independently distributed, then 
(i) x? and y; (ii) x and y?; (iii) x? and y? are independently distributed. Note that when 


9.4 Collection of dependent variables —— 257 


x and y are independently distributed, that is a property holding in all four quadrants. 
But (iii) is a property holding in the first quadrant only, (ii) is a property holding in the 
first and second quadrants only and (i) is a property holding in the first and fourth 
quadrants only. A property holding in a few quadrants need not hold in all quadrants 
unless the variables are restricted such as positive variables. If x and y are positive 
random variables then all properties in (1), (ii), (iii) will hold, otherwise (1) or (ii) or 
(iii) need not necessarily imply that x and y are independently distributed. 


Exercises 9.3 


9.3.1. Use a computer and select random numbers between 0 and 1. This is equiv- 
alent to taking independent observations from a uniform population over [0,1]. For 
each point, starting from the number of points n = 5, calculate the standardized sam- 


ple mean z = Vaw, remembering that for a uniform random variable over [0,1], 


H= 1 o? = i. Make many samples of size 5, form the frequency table of z values 
and smooth to get the approximate curve. Repeat this for samples of sizes, n = 5,6, ... 
and estimate n so that the simulated curve approximates well with a standard normal 


curve. 


9.3.2. Repeat Exercise 9.3.1 if the population is exponential with mean value p = 5. 
[Select a random number from the interval [0, 1]. Convert that into an observation from 
the exponential population by the probability integral transformation of Section 6.8 
in Chapter 6, and then proceed.] 


9.3.3. Consider the standardized sample mean when the sample comes from a gamma 
population with the scale parameter f = 1 and shape parameter a = 5. Show that the 
standardized sample mean is a relocated and re-scaled gamma variable. 


9.3.4. By using a computer or with the help of MAPLE or MATHEMATICA, compute 
the upper 596 tail as a function of n, the sample size. Determine n when the upper tail 
has good agreement with the upper 596 tail from a standard normal variable. 


9.3.5. Repeat the same Exercise 9.3.4 when the population is Bernoulli with the prob- 
ability of success (1) p - T symmetric case; (2) p = 0.2 non-symmetric case. 


9.4 Collection of dependent variables 


So far we considered only collections of independent random variables. But practical 
examples of dependent variables are plenty. General stochastic processes and time 
series come under this category. Here, we give one example of a sequence of dependent 
variables. 

If the dependent sequence of random variables is considered over time such as 
monitoring the price of staple food over time, stock market values of shares over time, 


258 —— 9 Collection of random variables 


water level in a dam over time, stock in a grain storage facility over time, etc. then such 
sequences of random variables over time, are called time series. 

There are sequences of variables which are branching in nature. Consider the 
population size in a banana or pineapple plant. Let us consider one banana plant 
to start with. This plant produces one bunch of bananas and when that bunch is cut 
the mother plant dies. But there will be two to three shoots from the bottom. These 
are the next generation plants. If these shoots are planted, then each will produce 
new shoots which will be the next generation plants. If the first generation had three 
shoots and each of these three shoots produced 2,3,3 shoots in the next generation, 
then the second generation size is 2 3 + 3 8. Thus the population size is available 
from a branching process. Such sequences of random variables are called branching 
processes. 

If we check the number of fish in a particular pool area in a river every morning, 
then the numbers are likely to be different on different days. By the next morning, 
some fish may have migrated out of the pool area and some others may have immi- 
grated into the area. If we check the population size in a given community of people 
every 10 years, then the numbers during successive observation periods are likely to 
be different. Some may have died out and some new births may have taken place. Such 
processes are birth and death processes. Special cases are the pure death process and 
pure birth process. 

An ideal hero portrayed in Malayalam movies may be walking home in the follow- 
ing fashion. He comes out of the liquor shop. At every minute, he takes either a step to 
the left or to the right. That step is followed by a random step at the next minute, and 
so on. Such processes are called random walk processes. 

The above are a few examples of dependent sequences of random variables, gen- 
erally known as stochastic processes. 


10 Sampling distributions 


10.1 Introduction 


In Chapter 9, we have already defined a simple random sample from a given popula- 
tion. The population may be designated by a random variable, its probability/density 
function, its distribution function, its moment generating function (mgf) or its char- 
acteristic function. For ready reference, we will list the definition once again. In this 
chapter, we will deal only with real random variables (not variables defined in the 
complex domain). 


Definition 10.1 (A simple random sample). Let Χι,Χ2,...,Χῃ be a set of indepen- 
dently and identically distributed (iid) random variables; for brevity, we write as 
iid random variables. Let the common probability/density function be denoted by 
f(x). Then the collection of random variables {x,,...,x,} is called a simple random 
sample of size n from the population designated by f (x). 


Example 10.1. If x,,...,x, are iid random variables following a Poisson distribution 
with probability function, 


Fe A>0,x=0,1,... 

Πα) 17 (10.1) 
0, elsewhere, 

then compute the joint probability function of the sample values. 

Solution 10.1. Here, for example, {x,,...,x,} is a simple random sample of size n = 4 

from this Poisson population with parameter A. Then the joint probability function, 


denoted by f,(x,,...,X,), is the product of the marginal probability functions, due to 
independence. That is, 


n Heh, πο 1f ST. un 
aI 


elsewhere 
Ant 
= "πα A>0, (10.2) 
0, elsewhere. 


Definition 10.2 (Likelihood function). Let {x,,...,x,} be a collection of random 
variables with the joint probability/density function f(x,,...,x,). Then this 
f(X ...,Xx4) at an observed value of (x;,...,x,) is called the likelihood function 
of the random variables x4, ..., Χῃ. 


If x;, ..., x, are a simple random sample from the population with probability/den- 
sity function f (x), then the likelihood function, denoted by L, is given by the following 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. JBA] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https: //doi.org/10.1515/9783110562545-010 


260 —— 10 Sampling distributions 


product due to iid nature: 
n 
L- ] [fœ (10.3) 
j=l 
when {x;,...,X,} is a set of observed values. 


Note 10.1. Here, we use the same small letters to denote mathematical variables, 
random variables and the values assumed by the random variables. The usage will 
be clear from the context. Many authors denote random variables by capital letters 
and the values assumed by them by small letters. This can create double notation 
for the same variable and logical inconsistencies when statements such as Pr{X < x} 
are made, where a big X is smaller than a small x. Besides, small letters are used 
to denote mathematical variables also. Hence we denote mathematical variables as 
well as random variables by small letters so that degenerate random variables will be 
interpreted as mathematical variables. 


In Example 10.1, suppose that the random variables x,,x>,x3,x, represent the 
number of traffic accidents on a stretch of a highway on 4 different independent oc- 
casions. Suppose that on the first occasion, possibly the first day of January, x, is 
observed as 2, on the second occasion, possibly first day of February, x, is observed 
as O, on the third occasion, possibly the first day of March, x3 is observed as 1 and 
on the 4th occasion x, is observed as 5. Then the likelihood function in this case is 
available from (10.3) by substituting the observations, namely 


e 4A J 2404145 7 λδρ-ὀλ 
210115 —— 240 ` 


fi =2,X> = 0,X3 =1,xX, =5) = (10.4) 


Example 10.2. Let x,,x,x3 be independently distributed gamma random variables 
with parameters (a4, p), (a5, p), (a5, B), respectively. Evaluate the likelihood function. 


Solution 10.2. Let L, denote the likelihood function here. Then 


Xx 
E 3 
034-1 -2 


PINE! -9 
Qe; xt le d xp ej 


ig x x 
B^T(a) ^ B*r() BST) 


1 
x ym ya ο 8 (X1+X)+X3) 


= B%*%+%T (a, T (a5)T (a3) 


at an observed point. Let the observations on the variables be the following: x, = 2, 
X =1, x3 =4. Then 


26-1 (1)&-1405-1g- (2444) 901-145-1675 


- = ; 10.5 
Lo petersr(ay)r(ao)r(as) — B%*2*%T(a,)T(ay)T (a5) -- 


Note 10.2. Since the joint probability/density function at the observed sample point 
is defined as the likelihood function, once the point is substituted then the function 


10.1 Introduction —— 261 


becomes a function of the parameters only, which may be observed from (10.4) and 
(10.5), and not a function of the variables x4, ... , Xp. 


Note 10.3. In most of the applications in this and succeeding chapters, we will be 
dealing with simple random samples or iid variables only, coming from a real univari- 
ate (scalar variable case) distribution. Hence, hereafter whenever we refer to a sample 
it will mean a simple random sample or iid variables. 


Definition 10.3 (Sample mean and the sample variance). Let x,, ...,x,, be iid vari- 
ables. Then the sample mean, denoted by x = sample mean, and the sample vari- 
ance, denoted by s? - sample variance, are defined as follows: 

(x; - xy 


Z (10.6) 


n 

2o Xp te +X 
Edi n. SM 
n 


5 


Note that when x,,...,x, are real scalar random variables then x and s? are random 
variables having their own distributions. If {χι = a,,...,x, = ay} 15 a given set of ob- 
servations on x, ...,x,, then an observed value of x is à = “=+ and that of s? is 

n (aj-à) 


j=1 n 
mean is a -2.5 and the observed value of s? is I [1- 3)? +(4- 3y] = 225. In general, 


X and s? are random variables, and not numbers. 


. For example, if n = 2, a, = 1, a; = 4 then the observed value of the sample 


Definition 10.4 (A statistic). Let the real scalar random variables x,,...,x, be a 
sample of size n coming from some population (need not be iid but usually we have 
iid variables). Any observable function T = T(x, ...,x,) of the sample values, hav- 
ing its own distribution, is called a statistic. [Plural of the term “statistic” is also 
called “statistics”, different from the subject matter Statistics. This is yet another 
unfortunate technical term in Statistics.] For example, the following are statistics: 


Ty,-Xqt:tXy Tp =X c άῃΧῃ, 


where αι...., αχ are known constants; 


Tycxprtexh T,= 04-27 30-27 


are statistics, and one can construct many such statistics on a given sample. Stu- 
dents usually have the following doubts: Suppose that we consider functions of the 


type 
u = (xX, — 0) + -+ + (xX, — 0); 


Ca- OV te HOG 81* 
65 


U = 


262 —— 10 Sampling distributions 


where 6,0, and 0, are some unknown parameters. Are u, and u, statistics? If the 
distributions of u, and u, are free of 0, 61,0. will u, and u, be statistics? The answer 
is “no”. As long as unknown parameters such as 0, 04, 0, are present, then the func- 
tions are not observable, and hence not statistics. If functions of sample values and 
some unknown parameters are there such that their distributions are free of all pa- 
rameters then such quantities are called *pivotal" quantities and their uses will be 
discussed in the chapter on confidence intervals. Hence the most important basic 
property for a statistic is its observability. 


Definition 10.5 (Sampling distributions). The distribution of a statistic is known 
as the sampling distribution of that statistic such as the sampling distribution of 
the sample mean x, sampling distribution of the sample variance «7, etc. 


Observe that the phrase “distribution” is used here in the sense that we have iden- 
tified a random variable by its probability/density function or its distribution func- 
tion, etc. It is unfortunate that there are too many similar sounding technical terms 
which are used in statistical literature, such as “a distribution"(means that a variable 
is identified such as normal distribution, gamma distribution, etc.), *a distribution 
function" (means the cumulative probability/density function), “sampling distribu- 
tion" (means a statistic is identified by its density or probability or distribution func- 
tion). Also “probability function" is used for discrete and mixed cases only but some 
authors use it for all cases. Similarly, "density function" is used for continuous cases 
only but some authors use for all cases. Hence there is no unanimous convention in 
the use of the terms “probability function" or “density function". In this book, we will 
use “probability function" for discrete and mixed cases and “density function" for the 
continuous case. 


10.2 Sampling distributions 


A major part of statistical inference in this module is concerned with Gaussian or nor- 
mal populations, and hence sampling distributions, when the sample comes from a 
normal population, are very important here. But we will also consider sampling dis- 
tributions when the sample comes from other populations as well. 


Example 10.3. Consider a real gamma population with the density function: 


X 
ale B 


m ACE x20,f»50,a»0 (107) 
0, elsewhere. 


Evaluate the density functions of (1) uj 2 x, + = + x,; (2) u 2 x = €; (3) u; = 


n 
X- ap; (4) u, = SD (5) us = lim, 00 Us. 


10.2 Sampling distributions —— 263 


Solution 10.3. It is easier to solve the problems with the help of the mgf of a gamma 
random variable or with the Laplace transform of a gamma density. The mgf of the 
random variable x or the density function f(x) = iF (x), where F(x) is the distribution 
function, denoted by M, (t) is the following: 


M,(t) = E[e*] = | ef (x)dx = | ed F(x) 
where t is a parameter and E denotes the expected value, is defined by the above inte- 
gral when the integral exists. [Replace integrals by sums in the discrete case.] Hence 
the M,(t) for the gamma density in (10.7) is the following: 


-Χ 
χα e B 


B^T(a) 


M,(t) = | eX dx =(1-Bt)* for1- t» O. (10.8) 
0 
Observe that the integral is not convergent if 1 — Bt < 0: 
(1) If x;,...,x, are iid variables with the mgf M,(t), then the sum has the mgf 
(M, (t)]" due to independence and identical distribution. Hence 


M,(0-0-BO*, 1-βί»0 (10.9) 


where u, = x + -+ 1 Χῃ. Since the mgf is unique, by examining (10.3) we see that u, is 
a gamma variable with the parameters (na, f). 

(2) Since the mgf of au, is mgf of u, with t replaced by at, then for u, = a we 
have 


M,,(t) = Μι) E (1 a E ie δ »0. (10.10) 


This shows that u, is gamma distributed with the parameters (na, B) for each n = 
1,2, .... One interesting property is obvious from (10.10). When n > co, 


pt -nQ 
ο a eMe qau 


which is the mgf of a degenerate random variable, taking the value aß with proba- 
bility 1. In other words, as n becomes larger and larger the curve becomes more and 
more peaked around the line x = af, which is the mean value of the gamma variable 
with parameters (a, B), and then eventually the whole probability mass will be con- 
centrated at the point x = αβ. The behavior of the graphs of the density of x for various 
values of the sample size n is shown in Figure 10.1. 


Remark 10.1. Some students may have the wrong notion, that since the standardized 
sample mean goes to a standard normal as the sample size goes to infinity, the sample 
mean itself has an approximate normal distribution for large n. This is incorrect, which 


264 —— 10 Sampling distributions 


Figure 10.1: The density of x when the population is 
gamma. 


may be seen from Figure 10.1. Even x — E[x] does not approximate to a normal, only 
the standardized sample mean will approximate to a standard normal when n is large. 
In other words, 


x - EX] 


y Var(x) 


for some ji and oj. 


~N(0,1) s» X=N(Uo,06) or x-E(X)«N(0,02) 


Result 10.1. When the population is a gamma population with the parameters (a, β), 
the sample mean x goes to ap with probability 1 when the sample size n goes to infinity 
or x converges to E(x) Ξ ap with probability 1 when n goes to infinity. 


(3) If a variable x is relocated at the point x = a, then the mgf, by definition is the 
following: 


M, q(t) = E[e θε @M, (0. 
If the variable x is relocated and re-scaled, that is, if y = ax + b then 
My (t) = eM, (at). (10.12) 
Therefore, 


—na 
M, (t) = ο M, (t) = ει z E) it E >0 (10.13) 


which shows that u; is a relocated gamma random variable with parameters (na, B) 
and re-location parameter af or with the density, denoted by fu, (u3), 


(us κέ af)'^ le^ 3 (14”-αβ) 


10.14 
(B/n)"^T(na) oM 


fu, (03) = 


for u; > af, a > 0, $ > 0, n =1,2,... 


10.2 Sampling distributions —— 265 


(4) 
μι" gas cae gr γῆνα 
0031 ΣΧ 
Therefore, the mgf of u, is given by 
ο CE DNE 2t 
M,,(t) =e (1 πα) , 1 Umen (10.15) 


which shows that u, is a relocated gamma random variable with parameters (na, σπα) 
and the relocation parameter is γα, for each n - 1,2, .... 

us is the limiting form of u, when n goes to infinity. A convenient way of taking 
this limit is to take the limit of the natural logarithm of the right side of (10.15), then 
expand and then take the limit. That is, 


t 
InM,, (t) = t na naln(1 m) EE 


t 
n 
$E 

ΕΤΩΝ 
2 p 


2 
=—+ οί) 5 asn oo. (10.16) 


Ξ-ί na + na] 


Since all terms containing t? and higher powers will contain yn and its powers in the 
denominator, all terms will go to zero when n co. 


2 2 
linlnM,(0- 5 = M,,(t)=e? (10.17) 
n=% 4 2 5 


which is the mgf of a standard normal variable. Hence u, has a standard normal dis- 
tribution with the density 


1 4 
fu (5) = Sos 2, —00 < U; < 00. 
Since a chi-square random variable is a particular case of a gamma random vari- 
able with a = ; and f 22, v 2 1,2,... (v is the Greek letter nu), if y ~ x orifyisa 
chi-square with v degrees of freedom then E(y) = aß = (5) x 2 = v and Var(y) = aß? = 
(5)(4) = 2v. Hence for a sample of size n from a chi-square distribution, with v degrees 
of freedom, the sample sum u, = x, + ++ +X, isa gamma with parameters na = ' and 
B =2 or μη is a chi-square with nv degrees of freedom or μι = X2,. If u, = x, then u 
is a gamma with the parameters a — : and f = 2. Therefore, we have the following 
result 


266 —— 10 Sampling distributions 


Result 10.2. When the population is a chi-square population with v degrees of free- 
dom, then as n > co 


(X - v) 


γ2ν 


where Ν(0, 1) denotes a standard normal population. 


yn —z ~ N(0,1) asn oo (10.18) 


As exponential variable is a gamma variable with a = 1 and £ = 0 > 0 and if y, 
denotes an exponential variable with parameter 6, then Ε(γ») = 0 and Var(y,) = 0?. If 
a sample of size n comes from an exponential population with parameter 0, then we 
have the following result. 


Result 10.3. Fora sample of size n from an exponential population with parameter 0, 


n(x -0 
xem —z-N(01, asn-oo. (10.19) 
Note that in this case the sample sum 11 = x, +-+- +X, isa gamma random variable 
with parameters a = n and f = 0. The sample mean u, = x isa gamma with parameters 
a-nandf- 9 for each n = 1,2, .... The above are the illustrations of the central limit 
theorem when the population is a gamma. 


Example 10.4. Consider a simple random sample of size n, {x,,...,x,}, from a 
Bernoulli population with probability function 


βία) =p*q*, x=0,1, q=1-p, O«p«1 


and zero elsewhere. [Note that for p = 0 or p = 1 we have a deterministic situation or a 
degenerate random variable.] Evaluate the probability functions of (1) u, = x +--+ 4 xy; 


Q)u; 2x2 8 75; 3) uy 2x - p; (4) uy = Nn (5) us = lim, νου Uy. 


Solution 10.4. Note that the mean value and the variance of a Bernoulli variable x are 
given by the following: E(x) = p, Var(x) = pq and Var(x) = Pu The mgf of the Bernoulli 
variable x is given by 


1 
M,(t) = bj e*p*qi* =qtpe, q-1-p,O«p«l1 
x=0 
(1) Therefore, the mgf of u, =x, +-+- +x, is available as 


M, (t) = (q * pet)". (10.20) 


But this is the mgf of a binomial random variable, and hence u, is a binomial random 
variable, with the probability function, 


10.2 Sampling distributions 


fj) = (; pra, u = 0,1,...,% q-1-p,O«p«1 
T 


and zero otherwise. 
(2) By using the earlier procedure, 


M,, (©) = (q + pe)". 


The probability function in this case is 


n 1 
119) = mogn-mo uu, =0,-,...,1 
fu, U2) (0, )? q 330 


and zero elsewhere. 
(3) By using the earlier procedure 


M, (t) = e?' M, (t) =e P (q +per)". 


This gives a relocated form of the probability function in Case (2). 


Pr{u, =y} Prfuz 4 Prfus Zp} 


for y = 0,1,...,n 
(4) 
ia vn X pvn _ Èj-1X j np 
^4 ypa ντα vipa wd 
Then the mgf is given by 


NL Lt 
M, (t) =e a (q + pev)". 


— 267 


(5) Consider the natural logarithm on both sides, then expand the exponential 


function: 


2 
nM, (0 =-~Pe+nin[g + r(1+ : + : *o(35:))] 


Jnpq  2!(npq) 
np 


=-+—t+nIn[1+e] 
γα 


where q + p =1and € = Ft ma + O57). But 


e 
Int Lees e pus for |e| « 1. 


Without loss of generality, we can assume that |e| « 1 for large n. Therefore, 


2 
In M, (t) = πο Te 


268 —— 10 Sampling distributions 


Now, collecting the coefficients of t on the right we see that it is zero. The coefficient 
of t? on the right is 7 and the remaining terms are of the order o(+.) —50asno oo. 
ΙΖ 


Hence, when n co, we have 


2 " 
In M,, (t) = im In M,, (t) Ξ 5 > M,,(t)=e? 


which is the mgf of a standard normal variable. Hence we have the following result: 


Result 10.4. When the sample of size n comes from a Bernoulli population p*q'“, 
x =0,1,q=1-p,0 <p <1then the standardized sample mean, which is equivalent to 
the standardized binomial variable, goes to a standard normal variable when n goes 
to infinity. That is, 


* _ X - E(X) ο ΣΜΙΧΙ MP χ-πρ 
^ Waw νπ ο νπρᾷ 


as n 5 co where x is the binomial random variable. 


>z ~ N(0,1) 


Thus, in the binomial case the standardized variable itself goes to the standard 
normal variable when the number of Bernoulli trials goes to infinity. This result is also 
consistent with the central limit theorem where the population is the Bernoulli popu- 
lation. 


Note 10.4. If x is a real scalar random variable with E(x) = u and Var(x) = σ’, then 
y- LE is called the standardized x, with E(y) = 0 and Var(y) - 1. 


Note 10.5. When a simple random sample of size n comes from a Bernoulli popula- 
tion, then the likelihood function L is given by 


n 
L=[ [pq =p"q"* (10.21) 
jel 


where x is a binomial random variable at the observed sample point. Observe that the 
number of combinations (7), appearing in the binomial probability function, does not 
enter into the likelihood function in (10.21). 


Example 10.5. Let x,,...,x, be iid real scalar random variables following a normal 
distribution N(j,0?). Compute the distributions of (1) μη = a,x, + =- + A,X, where 
αι...» Ay, are constants; (2) uy = X; (3) u3 =X — μι (4) u; = ναι - p). 


Solution 10.5. The Gaussian or normal density function for a real scalar random vari- 
able x is given by 


1 
α-2 


ovn ` 


egy 


fh) = 


-οο«χ«οο,σ20,-οο«μ«οο 


10.2 Sampling distributions —— 269 


and the mgf of x is given by 
M,(t) = E[e*] = | ef. (x)dx = et 200, (10.22) 


(1) In order to compute the distribution of μη, we will compute the mgf of u, and 
then try to identify the distribution from this mgf: 


n 
M,, (t) = E[ef s] = [ J Efes] 
j=l 


n 202 
E [errs = el a) 7-5, aj) 
jl 


due to x;,...,X, being iid normal variables. But this mgf is that of a normal variable 
with mean value μ NR a; and variance o? Èj- aj. Therefore, 


n n 
Uy ~w( uJ aY a) 
jal jal 
where “~” indicates “distributed as”. 


Note 10.6. If N(u;, 0j), j=1,...,n are independently distributed and if u, = αιχι ++ + 
AnXn then from the above procedure, it is evident that 


n n 
µη ~ ΝΣ αμ, dia). (0.23) 
J= 


j=l 


Result 10.5. If x; ~ N(uj, 0j), j=1,...,k and mutually independently distributed and 
if u = αιχι +--+ + A,X; is a linear function, where a}, ...,a, are constants, then 


k k 
u~ ΝΣ αι, X e) (10.24) 
ji ΕΙ 
(2) Putting k = n, a) = +++ = ap = Ln = + = Un =H, σὶ = ~ = 02 = 0? in (1024) we 
have 
s o tu 186 
is - x Νίο). n=1,2,...; Μι(ϐ = e2 n, 


Thus, for each n, x is again normally distributed with mean value y and variance o^[n. 
(3) 
to? 


M, (t - M, Oe. 


This means that u, = X - u ~ N(0, σ) for every n= 1,2, ... 


270 —— 10 Sampling distributions 


(4) " 
n inpr? e 
Mp O =M (t) =e Ci 


Therefore, u, ~ N(0,1), n = 1,2, ... Thus, for each n = 1,2, ... including n — oo, the stan- 
dardized sample mean u, = πω is exactly standard normal for each n, when the 
sample comes from a normal population. When the sample comes from other popula- 
tions with finite variance, we have seen that the standardized sample mean goes to a 
standard normal variable when the sample size goes to infinity. In the following Fig- 
ure 10.2, (a) is the density of the sample mean x, (b) is the density of x - u and (c) is 


the density of the standardized variable when the population is normal. 


AA ZS 


(a) (b) (c) 


Figure 10.2: Density cure for sample mean when the population is Gaussian. 


Example 10.6. Let z,,...,z,, be iid real scalar random variables following a standard 
2 


normal distribution N(0, 1). Compute the distributions of (1) u, = z2; 2) u =Z? +--+ +22. 
Solution 10.6. (1) This was already done in Module 6. For the sake of completeness, 
we will repeat here by using transformation of variables. Another method by using 
the distribution function is given in the exercises. Here, z, is standard normal and its 
density is given by 

2 


21 


ez 
Rh) = Wm 


But the transformation u, =z? is not one to one since Ζι can take negative values also. 
But in each interval (-οο < z; < 0) and (0 < z; < co), the transformation is one to one. 
Consider the interval 0 x z; < oo. Then 


2 > 1 


-00 < Z, «οὐ. 


14 
uj du, 


Uy = 24 ^ z-ui ^ dz 2 
and that part of the density of u,, denoted by 
1 1 lp. ΠΗ 
(u ----ἰ e? 
83 (Uy) 248 


10.2 Sampling distributions —— 271 


But in the interval (—co, 0) also the function f5(z;) is the same and an even function. 

Hence the density of u,, denoted by g3(u,), is given by 

1 d 

r—-uj e 7, Oxuj«oo 

83(u1) = 1 2215 
0, elsewhere 


which is a gamma density with a = 3 and f = 2 or it is a chi-square density with one 
degree of freedom or 


z-N(41, m=z ~x}; M,(-0-203, 1-2t»0. (10.25) 


(2) Since the mgf of a sum of independent variables is the product of the individual 
mgf, we have 


n 
M, (5 = | [M (0 = [M (0]" = a - 2072. 
ja 
Therefore, we have the following result. 


Result 10.6. For z;,...,z, iid with common distribution Ν(0, 1), then 


n 
zZi-yj and u= o. ~X. (10.26) 
jal 


Exercises 10.2 


10.2.1. If x,,...,x,, are iid from a uniform population over [0,1], evaluate the density 
of x, +- +X, for (1) n =2; (2) n= 3. What is the distribution in the general case? 


10.2.2. If x;,..., x, are iid Poisson distributed with parameter A, then (1) derive the 
probability function of u, 2 x, + --- + Χῃ; (2) write down the probability function of x. 


10.2.3. If xj, ..., x, are iid type-1 beta distributed with parameters (a, B), then compute 
the density of (1) μη - χι + x5; (2) Uy =X, + x; + X3. 

10.2.4. Repeat Exercise 10.2.3 if the population is type-2 beta with the parameters 
(a, p). 

10.2.5. State the central limit theorem explicitly if the sample comes from (1) type-1 
beta population; (2) type-2 beta population. 

10.2.6. Let x,,...,x,, be iid Bernoulli distributed with parameter p, 0 < p < 1. Let 


1 
Uy ee η 


O° app) 5 αρ 


Uy =X, ++ + Xn — Np; 


272 —— 10 Sampling distributions 


Using a computer, or otherwise, evaluate y so that Pr{|u,| > y} = 0.05 for n = 10,20, 30, 
50 and compute n, such that for all n > no, y approximates to the corresponding stan- 
dard normal value 1.96. 


10.2.7. Repeat Exercise 10.2.6 with u, of Example 10.4 and make comments about bi- 
nomial approximation to a standard normal variable. 


10.2.8. Let x bea gamma random variable with parameters (n, £), n = 1,2, .... Compute 
the mgf of (1) u = x: 2)u; = x - n; (3) u = 2 (4) show that u; goes to a standard 
normal variable when n co. 


10.2.9. Interpret (4) of Exercise 10.2.8 in terms of the central limit theorem. Which is 
the population and which is the sample? 


10.2.10. Is there any connection between central limit theorem and infinite divisibil- 
ity of real random variables? Explain. 


10.2.11. Letz ~ N(0,1). Let y =z*. Compute the following probabilities: (1) Pr{y < u} = 
Pr(z? < u} = Prilz| < vu); (2) by using (1) derive the distribution function of y and 
thereby the density of y; (3) show that y ~ x7. 


10.2.12. Let x;,...,x, be iid with E(x;) = uj, Varo) == o? < œ, j=1,...,n. Let X = 
577^. Consider the standardized x, 
|x-E() Σια μ)) 
Var) Jo κ... +02 


Assuming the existence of the mgf of x;, j = 1,...,n work out a condition on 024-02 


so that u —> z ~ N(0,1) as n > co. 


10.2.13. Generalize Exercise 10.2.12 when u is the standardized v = a,x, + --- + ἄμχῃ 
when X' = (x,,..., x4) has a joint distribution with covariance matrix X = (ση) with 
[ΣΙ < co and a’ = (a,,...,a,) is a fixed vector of constants and ||(-)|| denotes a norm 
of (-). 


10.3 Sampling distributions when the population is normal 


Here, we will investigate some sampling distributions when the population is normal. 
We have seen several results in this category already. Let x ~ N (u, 0?) be the population 
and let x, ...,x, bea simple random sample from this population. Then we have seen 
that 


10.3 Sampling distributions when the population is normal —— 273 


Note that when n becomes larger and larger then the variance of x becomes smaller 
and smaller and finally it goes to zero. In other words, the normal distribution degen- 
erates to the point at x = u with the total probability mass 1 at this point. That is, x 
converges to u with probability 1. This, in fact, is a general result, which was stated 
as the weak law of large numbers in Chapter 9. We have the following results from the 
discussions so far. 


Result 10.7. 


X;-H X; -uy 
“Ν(μσ) Ὁ z-7——-NQ3 4-(- ) -χἱ 


c o? yn 
X x; ~ N(ny, no’); z~N(m 2); u = —(ž - u) ~ N(0,1); 
es n σ 


2 n n 2 
pee X% -= XH 
ἷ-----μρ-χὲ (AME) Xn Y(2—) ~e. 


Result 10.8. From Result 10.7, we have the following when the population is N (p, 0°): 


ο μα. 


val] = Var(y7) = 2; 


ΜΥ _ z 
- ) =v) oon (10.27) 


Note 10.7. Corresponding properties hold even if x,,...,x,, are not identically dis- 
tributed but independently distributed as x; ~ N(uj, 0j), j=1,...,n 


Result 10.9. For x; ~ Νίμ»σ] 02), j = 1,...,n and independently distributed, we have 
the following S 
.—Ww-N(0,02; ~~ ~ N(0,1); 
x; - uj ~ Ν(0,σ}) z (0,1) 
IA g Sun 22] 2 
5 ~X 10.28 
(+ = (10.28) 


j j=l 


Example 10.7. Compute the expected value of the sample variance when the sample 
comes from any population with finite variance and compute the distribution of the 
sample variance when the sample comes from a normal population. 


Solution 10.7. Let x,,...,x, be a simple random sample from any population with 
mean value y and variance σ΄ < co. Then 


274 ---- 10 Sampling distributions 


E(x) =p; E(x; -p) =0; εί 2 =0; 
J > J » σ > 


X; — 
Var(x;)=07; Var(x; -μ) = 0°; Var( 3 É 


2 

) =1; Εα]-μ; 
E[x-u]=0; Var(x) = — 

The sample variance can be represented as follows: 


xy n 
- -15 u*u-xy* 
jal 


d. -W+ (=p)? + Eu- ολα - y) 
--Ye- - M «(x - uy - zx - uy 


(x; - uy - ( - py. (10.29) 


ων 


j= 
Taking expectations on both sides, we have 


ix 1 o? 
E(s?) =- Y varo) - Var(X) = —no? - — = — 0°. 
π ΕΙ n n 
This shows that 


- 9, (10.30) 


a property called unbiasedness, which will be discussed in the chapter on estimation. 
The above result says that δα, s τος is unbiased for the population variance, what- 
ever be the population, as jane as the population variance is finite. 


If the population is normal, then we have shown that 


n 2 
6 -μ) n, 
Σ 3 -χ} and στα -uy «xi. 


But from (10.29), it is evident that 


which means 


n 
Mae). = +X}. (10.31) 
But we have the property that if x2, and x? are independently distributed then 


Xin * X2 5Xiin- (10.32) 


10.3 Sampling distributions when the population is normal —— 275 


Independence of x and s? will guarantee from (10.31), by looking at the mgf of both 
sides in (10.31), that 


Σ - E -χλι. (0.33) 


But it can be shown that if the sample comes from a normal population, then x and 

s ια - x? are independently distributed. Hence for the normal populo. result 
(10.33) holds. Independence will be proved later on. Independence of x and s?, along 
with some minor conditions, will in fact, characterize a normal population; see the 
book [11]. 


Note 10.8. If x and y arereal scalar random variables and if x and y are independently 
distributed and if x, = a,x + bi, y; = ay + by, where a, + 0, a, + 0, Βι, b>, are con- 
stants, then x, and y, are also independently distributed. Are x and y? independently 
distributed? Are x? and y independently distributed? Are x? and y? independently dis- 
tributed? 


Note 10.9. If x and y are real scalar random variables and if x? and y? are indepen- 
dently distributed, then are the following independently distributed: (1) x? and y; (2) x 
and y^; (3) x andy ? 


Example 10.8 (Non-central chi-square). Let x, ... ,X„ be iid variables from a Ν(μ, 0?). 
2 
Evaluate the density of u = YR 2, u + 0. [This is known as a non-central chi-square 


with n degrees of freedom and non-centrality parameter A = δ and it is written as 


α-μ 


u ^ x2(A) because when p is present, then Èj a Xi or nr chi-square with n 


degrees of freedom.] 


Solution 10.8. Since the joint density of x,, ... , x, is available, let us compute the mgf 
of u, that is, M,,(t) = E[e™]. 


-f eum 
2 
j 


- e 05- zu 
be 7 lann «λάχῃ. (10.34) 


Let us simplify the exponent: 


x2 


bor deus Hc 
iuh 


-1 


xoa 2t)x7 my x eme 
jal 


276 —— 10 Sampling distributions 


ped L S (172 E ) 
1-2t 2σ24 ) 4A-2t 


je 


for1-2t > OandA- 2. 
For x, the integral is the following: 


o 
T | e a3 (VEG ο), dx; = 1 


Put y; = V1- 2tx;- —+~ and integrate variables one at a time. 
j Vi-2t 


0 V27t J- 1-2t 
from the total integral of a normal density. Therefore, 
Μι ( ) = £ e τὰ = ba n 


e = 
(1 -- 2t)n/2 ἘΚ. Cao 


But we know that (1-2t)~‘2*" is the mgf of a chi-square with n + 2k degrees of freedom 
and its density isa gamma with parameters (a = 5 Κ,βΞ 2) and hence the density of u, 
denoted by g(u), is given by 


co AK E uite T 
gg Pup u>0 (10.35) 


and zero elsewhere. This is the non-central chi-square density, which is in the form 
of a weighted gamma (chi-square) densities, weights being Poisson probabilities or 
Poisson-weighted chi-square densities with n + 2k degrees of freedom. Observe also 
that since A > 0 we have P, = Fea with Σχ ο P, = 10r the coefficients are from a Pois- 
son distribution. The non- cn ral chi-square density is of the form: 


g(u) - Σ Γιίω)Ρι 
k=0 


where 
ywitk-le-5 
fia = 2p ky’ 150 (10.36) 
2 
and 
Ak A 
P, τε 


This density is a very important density, which is also connected to Bessel function in 
the theory of special functions. 


Exercises 10.3 

10.3.1. Ifx,,x,x3 are independently distributed so that x, ~ N(0, 0? 21), X ~ N(u-2, 
0? = Δ), x; ~ N(u = -1,0? = 5) evaluate the densities of the following: (1) μι = X1 4x5 *x3; 
(2) u 22x, - 3x4 + 5x3; (3) μα = X? + ο δν y ep. 3 (4) u, =X? + 05 - 2; (5) us =x? + 


x x 
ats: 


10.4 Student-t and F distributions —— 277 


10.3.2. By using the mgf or otherwise compute the exact densities of (1) μι = X2 - n; 


(2) u, = Gm, (3) show that u > z ~ N(0,1) as n > co. 


10.3.3. Interpret (3) in Exercise 10.3.2 in terms of the central limit theorem. What is 
the population and what is the sample? 


10.3.4. By using a computer (1) compute y so that Pr([u;| = y} = 0.05 for n = 10, 20, 30 
where u, is the same 112 in Exercise 10.3.2; (2) determine n so that y approximates well 
with the corresponding Ν(0, 1) value 1.96 at the 596 level (tail area is 0.05). 


10.3.5. Find πρ such that for n > πρ the standard normal approximation in Exer- 
cise 10.3.4 holds well. 


10.4 Student-t and F distributions 


The distribution of a random variable of the type u = 


= . 
where z is a standard nor- 
vy/v 


mal, z ~ N(0,1), y is a chi-square with v degrees of freedom, y ~ x7, where z and y are 
independently distributed, is known as a Student-t variable with v degrees of free- 
dom, t,. 


Definition 10.6 (A Student-t statistic t,). A Student-t variable with v degrees of 
freedom is defined as 


i -u--—-——. z~N(0,1), yx? (10.37) 
where z and y are independently distributed. 


The person who derived the density of the variable t,, W. Gossett, wrote the pa- 
per under the name “a student”, and hence the distribution is known in the litera- 
ture as the Student-t distribution. Before deriving the density, let us examine more 
general structures and derive the density as a special case of such a general struc- 
ture. 


Example 10.9. Let χι and x, be independently distributed real gamma random vari- 
ables with the parameters (αι. B) and (a, p), respectively, that is, the scale parameters 
are equal to some f > 0. Let uy =x, + X5, Uy = us = a Derive the distributions of 
Uy, 11», U3. 


χι 
3 
XX; 


Solution 10.9. Due to independence, the joint density of x, and x,, denoted by 
f 06, x5), is given by 


84-1 05-15 5 037) 


foa) =~ garara (a) " 


Oxx;«oo i=1,2 (10.38) 


278 —— 10 Sampling distributions 


and f(x,,X) = 0 elsewhere. Let us make the polar coordinate transformation: x, = 
r cos? 0, x, = rsin? 6. We have taken cos? and sin? 0 due to the presence of x, + x; in 
the exponent. The Jacobian of the transformation is the determinant 


Ox, Ox, 3 
= æ| |cos0 -2rcos0sin6 . 
E ax, | Ξ]-..2 . = 2r cos @sin 0. 
a 0x, sinf  2rcos0sinO 
Y 


The joint density of r and 6, denoted by g(r, 0), is given by 


20 0,-1 η 20 05-1 7 

ος). ΠΡ e 22rcos@sin@ 
p^i*?T(a)T(a5) 

. γααν1ρ Τ(αι +a) 

- BYP (ay + A>) Γίαι)Γ(α}) 

x (cos? 0)" (sin? 6)”"!2cos Osin 0 


= g1(r)g>(8) (10.39) 


g(r,@) = 


by multiplying and dividing by Τίαι + αχ). From (10.39), a few properties are obvious. 
(a) x, +x, =r has the density g,(r), which is a gamma density, and hence 1} = x, + x; 
is gamma distributed with parameters (αι + a5, β); (b) Since (10.39) is in the form of a 
product of two densities, one is a function of r alone and the other is a function of 0, 
the variables r and 0 are independently distributed. (c) 


Pe r cos? 0 


- - 5 —— = cos?8 
Xı +X, rcos^O-«rsin^0 


uy 


x 
X 4X 


tributed. (d) But x, = uju > E(x") = E(u")E(u®) due to the independence of u, and 2. 
Therefore, we have the following result. 


is a function of 0 alone. Hence u, = x, + x; and u, = are independently dis- 


Result 10.10. When the real scalar random variables x, and x, are independently 
distributed as gamma variables with parameters (αι, B) and (a, p), respectively, with 


the same B, and when u; = x, + X, u; = ^, then 
περ 


E h h E h 
Wesen c os e 
E(uj) Xi * X? E(x, * x5) 
Note that even if y, and y, are independently distributed, E (A + ze p. From 
2 


(10.29), the density of u, = cos? 0 is the following: The non-zero part of the den- 
sity of χι and x, is in the first quadrant, and hence 0 < r < co, 0 < 0 < 2. Note that 
du, = 2cos @sin 080. But the density of 0 is 


£;(0) = (cos? 8)^' (sin?) !2cos0sinó, O<@< 5 (10.40) 


10.4 Student-t and F distributions —- 279 


and g,(@) = 0 elsewhere. Also, for 0 < 0 < 5 means 0 < u, = cos?0 < 1. Hence the 


density of u, is given by 


T (a, * 42) αι-1 
I(a,)I(a5) ? 


δι, (12) = (1-1. Ος <1, αι20, a >0 (10.41) 
and g,, (12) = 0 elsewhere. Hence u, is a type- beta variable with parameters (αι, a), 
and therefore 1 — u, is a type-1 beta with the parameters (a5, a4). 


Result 10.11. Ifx, and x, are as in Result 10.10, then u, = m= is a type-1 beta random 
variable with the parameters (αι. 42). 


(e) u; = x - reos - cot? 0. We can evaluate the density of U3 = cot? 0 either from 


the density of 0 in (10.40) or from the density of u, in (10.41). 


x xix u u 
RNC ae iX us > De 
Χι +X = 14+X,/xX, 111, 1-10 
1 1 
du; = — ——cdu,-(1«u4?du, > du, = dus. 
42> 304 ^^ HERI ONE 2 (yay? 
From (10.41), the density of u}, denoted by $y, (U3), is given by 
(u) =- T t0) ( 13 D 1 r 1 
Suw U3 = Ta Ta) \14 ug itu,) Urug’ 
-1 E 
To uj (Lus) 0, 
Ξ O < u3 < œo, Q > 0, 0550 (10.42) 


O, elsewhere. 


Therefore, u3 = τ is type2 beta distributed with parameters (αι, 42) and then u, = z 
is type-2 beta distributed with the parameters (a, αι). 


Result 10.12. If x, and x, are as in Result 10.10, then u; = = is a type-2 beta with the 
parameters (αι, 42). 


Now, consider a particular case of a gamma variable, namely the chi-square vari- 
able. Let x, ~ X2, and x; ~ x? be independently distributed. x, ~ X7, means a gamma 
with the parameters (a = 2, p - 2). Then 


X 
u4- 
X2 


has the density, for m,n = 1,2, ..., 


(=) m4 men 
2 uU? (uy 3), Oxsuj«oo (10.43) 


fu (u3) = raras 


280 —— 10 Sampling distributions 


and fu, (u3) = 0 elsewhere. A constant multiple, namely, 


τ. xim 
— Uu; = 
m Xan 


or a chi-square with m degrees of freedom divided by its degrees of freedom and 
the whole thing is divided by a chi-square with n degrees of freedom, divided by 
its degrees of freedom, where the two chi-squares are independently distributed, is 
known as a F random variable with m and n degrees of freedom and usually written 
as Finn: 


Definition 10.7 (F random variable). A F = Fm n random variable is defined, and it 
is connected to u3, as follows: 


Then the F-density is available from the type-2 beta density or from (10.43), and it 
is the following: 


r(252) 


(2 Hn) 


for O < F4, «oo, mn=1,2,... and fr (Fm,n) = Ὁ elsewhere. 


m 
2 


m -( mu) 
m m 2 
eF nn) = Fan (1+ Fma) (10.44) 


Special Case for m =1, n = v. Let F}, = t}. Then putting m =1, n = vin (10.44) we have 
the density of F,, = k At = = € which will then be the density of the square of Student-t 
with v degrees of freedom. Denoting the density by f,(t2), we have the following: 


Γ(5) 
TG) Vv 


nif, & m 2 
— (E) (1+2) , O<t <œ. 
ν 


f(t) = 


Note that 


vd EH. egets 
γον yer : ilv 


where z is a standard normal variable. For t, > O, itis a one to one transformation and 


Fiv ? It, | 


TENES E 
dt, = 2Ειν dE, or 2dt,-F,;dF,,. 


Hence for t, > 0 the folded Student-t density is given by 


(4 


πιά) = 200 age 


T -Qm) 
(1 qoe 3 , OK<t,<oo. (10.45) 


10.4 Student-t and F distributions —— 281 


ee (€ 


0 Figure 10.3: Student-t density. 


Since it is symmetric about f, Ξ 0, the Student-t density is given by 


Is g 
fit) = ΠΟΤΩΝ + 3 
A graph of Student-t density is given in Figure 10.3. Another way of deriving the den- 
sity directly from the joint density of independently distributed standard normal vari- 
able and a chi-square variable is by using transformation of variables. Since t, = Sh P 
where z ~ N(0,1) and y ~ X? where z and y are independently distributed, for z > 0 the 
transformation is one to one, and similarly for z < 0 also the transformation is one to 
one. Let z > 0. Take u= της and v = y. Then dz ^ dy = RT ^ dv. The joint density of 


z and y, denoted by f(z, y), is given by 


νε 
-(4) 
) , -oco«ft,«oo. 


2 
us Vay. 


e ΝΗ y 
Z,Yy)=— xs > Z=uUury{-,y=v 
fizy) νσπ 21/2) n y 


and then the joint density of u and v, denoted by g(u,v), is given by 


Integrating out y, we have that part of the marginal density for u given by 


1 CO γε 1 u2 
gi(u) = | y?3e b. = Jay 
3 2 T(Y) yun 0 V 
πι ee 
Ξ 1 £) , u»0 
ral T ν 


The same is the function for u < 0, and hence for —oo < t, < co, u? = t2 we have the 
density of t, given by 


τα 2 
f(b) = C3 (1+! 


IVa v 
Observe that when the sample x,,...,x,, comes from a normal population N (y, 0?) we 
have 


vH 
A>) 
) , -οος«ίν«ςοο. 


vi ~ N(0,1) 


282 —— 10 Sampling distributions 


and 

n 
(x;-x)? 2» 
I 5 Xn 


and these two are independently distributed. Hence 


ta = 


vn(x - =H ly (x; - x)? | - _ Υπ --μ) 
(n - 1)o? δ 

is a Student-t with n -- 1 degrees of freedom, where δ2- dm Gee which i ds the un- 

biased estimator for o?. Thus, if o? is replaced by its unbiased estimator ô? then the 

standardized normal variable changes to a Student-t variable with n — 1 degrees of 

freedom. 


—N(0,1) and VEW a. (10.46) 


vn(x - μ) 
σ 
We also have a corresponding distribution on variance ratios. Consider two indepen- 
dent populations N (u, 02) and N (p>, 02) and let x4, ..., x4, and y,, ..., y, be iid variables 
from these two populations respectively. Then 


ójoij Nr06-X"/(m-D0) | 
03/0 Mu0i-Y»K(mn-)e) τ" 
Mia 08-3) (m - 1) 
Zami- Y)?/(n- 1) 
This F-density is also known as the density for the “variance ratio”, which is useful in 


testing hypotheses of the type σί = 03. Observe that the results in (10.46) and (10.47) 
do not hold when the populations are not independent normal populations. 


~Fmin1 foro? =05. (10.47) 


Exercises 10.4 


10.4.1. Let x, ...,X be iid variables from the population N(y,, 07) and let y,, ... , Yn be 
iid variables from the population Νίμ», 03) and let all variables be mutually indepen- 
dently distributed. [This is also known as samples coming from independent popula- 
tions.] Let x=)", 2 and y = itam ” Then show that the following results hold: 


He 
X; -4 wy? m /X; -py y? 

o (L*)-3 o s-Y(2—5) ~ 
σι jas δι 


y;- 2 n y;- 2 
(3) (2-5) ~y2; — (4 si- Y (2-5) -xh 
0» jal’ 92 


s? n 
5) 31. type-2 bet (5. 2) 
(5) 5 ype2 beta | >.> 


10.4 Student-t and F distributions —— 283 


51 mn 
6 ~ type- bet (5. z); 
(6) St + $3 ο ον 
2 πι x \2 
n si . 2_ πας doo 
D pg Fm © =D (A) as 
n y\2 
y;-y 
(9) $-Y(9—) 2s 
j=l 02 


2 
(10) 2 uf type2 beta (7, 2), 
52 2 


2 
E HN ei beta (=, m } 
54 + Si d 2 ΕΕ 
n-1 53 
(12) πο m-1n-1 


Note that when 0? = σὲ = c? then all the variances, σ], 03, will disappear from all the 
ratios above. Hence the above results are important in testing hypotheses of the type 
De ας 

01-205-0*. 
10.4.2. For the same samples in Exercise 10.4.1, evaluate the densities of the following 
variables: 


1 1 
(1) u- 363% -x35 (2) w= 35; ^ -y»5 

1 1 
(3) U3 = 5% -x35 (4) u, = 501 =Y>)"5 

02 (x, — x3 (x, - x3 
(5 us = Si “2; (6) α,------ᾱ-. 

: Of 1- y»? i -y»* 
(7) u= τ (8) Vue. 
10.4.3. Show that F,,, = ;—. Let x = Fm n with density f,(x) and let y = - = Fam with 


density f(y). The notation Fn na Means the point from where onward to ‘the right the 
area under the curve f,(x) is a. Then F,,,, , means the point from where onward to 
the right the area under the curve f,(y) is 1 -- a. By using the densities Π(χ) and f,(y) 
and then by transforming y = 1 show that 


1 
F 


mna = G 
e Fam 1-a 


[Note that, due to this property, only the right tail areas are tabulated in the case of F 
distribution. Such numerical tables, called F-tables, are available.] 


10.4.4. Notations X?,, ἐν α» Fmna mean the point from where onward to the right the 
area under the curve in the case of chi-square density, Student-t density and F-density, 
is a. By using a computer, compute y?,, tya» Fmna for a = 0.05 (5% tables), a = 0.01 
(1% tables) for various values of v,m,n = 1,2, .... [This is equivalent to creating 5% and 
196 chi-square, Student-t and F-tables.] 


284 —— 10 Sampling distributions 


10.4.5. Derive the density of a non-central F, where the numerator chi-square is non- 
central with m degrees of freedom and non-centrality parameter A, and the denomi- 
nator chi-square is central with n degrees of freedom. 


10.4.6. Derive the density of a doubly non-central Fm ,(A,,A2) with degrees of freedom 
m and n and non-centrality parameters A, and A). 


10.4.7. For the standard normal distribution x ~ N(0,1), Pr{|x| = y} = 0.05 means 
y = 1.96. By using a computer, calculate y such that Pr{|t,| => y} = 0.05 for v = 10,20, 
30,100. Then show that a Student-t does not approximate well to a standard normal 
variable even for v - 100. Hence conclude that reading from standard normal tables, 
when the degrees of freedom of a Student-t, is greater than or equal to 30 is not a valid 
procedure. 


10.4.8. For a type- beta variable x with parameters a and β show that 


E(x") B Γία 4 h) Τ(β — h) 
τ(α) T8) 
when real and -R(a) < R(h) < R(B) when in the complex domain. What are the corre- 


sponding conditions for (1) F,, , random variable; (2) Student-t variable with v degrees 
of freedom. [Hint: x = = Finn-] 


a<h<B 


10.4.9. Show that (1) E(t,) does not exist for v = 1, 2; (2) E(t?) does not exist for v = 3,4. 


10.4.10. Evaluate the h-th moment of a non-central chi-square with v degrees of free- 
dom and non-centrality parameter A, and write down its conditions for existence. 
Write it as a hypergeometric function, if possible. 


10.4.11. Evaluate the h-th moment of a (1): singly non-central Fm n(A) with numerator 
chi-square x2 (A); (2): doubly non-central Finn (Ay Az), and write down the conditions 
for its existence. 


10.5 Linear forms and quadratic forms 


We have already looked into linear functions of normally distributed random variables 
in equation (10.24). We have seen that arbitrary linear functions of independently dis- 
tributed normal variables are again normally distributed. We will show later that this 
property holds even if the variables are not independently distributed but having a cer- 
tain form of a joint normal distribution. First, we will look into some convenient ways 
of writing linear forms and quadratic forms by using vector and matrix notations. 

Consider a set of real scalar random variables x,,...,x; and real scalar constants 
à, ..., Qy. Then linear form is of the type y and a linear expression is of the type y, = 
y + b where b is a constant, where 


ΥΞάαΙΧ11-.'' -αχχκ; Vy = AX c + AX, b. 


10.5 Linear forms and quadratic forms —— 285 


These can also be written as 
y-a'X-X'a y,=a'X+b=X'a+b (10.48) 


where a prime denotes the transpose and 


αι χι 
αξι τ |, Χξι 11, a'z(aya5,...,ap, X'=(%4,...,X,). 
αι Xk 
For example, 
χι 2 
Uy = 2X, -Χ} +X% = [5-11] | 5 | = [χι Χο, 4): u5224q-X;*x4*5-a!X 15. 
X3 1 
Here, b=5. 


A simple quadratic form is of the type X' X =x? &x2 +- +x? where X' -(χι,...»Χχ) 


and the prime denotes the transpose. A general quadratic form is of the type 


X'AX = Σ ` aXX (10.49) 
iz1j-1 
k 
-Σ αρῷ 1», aux, (10.50) 
j=l i ji+j 
k 
=) ayj + 25 aru m -Yapi 1-2} ajx (10.51) 


j=1 i<j i»j 


where the matrix A = A’ without loss of generality, X' = (x,,...,x;,). The coefficient of 
x;Xj is ay for all i and j, including i = j. In (10.49), all terms, including the case i = j, 
are Qiien in a single expression. In (10.50), the diagonal terms and all non-diagonal 
terms are separated. Due to symmetry, A = A', we have a; = aj; and hence the coef- 
ficients of xjx; will be the same as that of x;x;. Thus some of the terms appear twice 
and this is reflected in (10.51). For example, for k - 2 the above representations are 
equivalent to the following: 


[ = 2 2 
X ΑΧ = Ay XT  αΙΟΧΙΧ2 + Ay XgXq + 055X5 
_ 2 2 ; _ 
= Ay X{ + 4X5 Γ2412χιΧ2 SINCE Ay = Any 


= 2 2 
= Ay XT + 053X5  2421Χ2χι. 


Definition 10.8 (Linear Forms, Quadratic Form, Linear Expressions and Quadratic 
Expressions). A linear form is where all terms are of degree one each. A linear ex- 
pression is one where the maximum degree is one. A quadratic form is where all 
terms are of degree 2 each. A quadratic expression is such that the maximum de- 
gree of the terms is two. 


286 ---- 10 Sampling distributions 


Thus, a quadratic expression has a general representation X'AX + a'X + b where 
a' X is a linear form and b is a scalar constant, a’ = (a,,...,a,) a set of scalar constants. 
Examples will be of the following types: 


u =X? +- +x? (a quadratic form); 
u, = 2x? — 5x2 + x2 — 2x,X) --6Χ2χ (a quadratic form); 


μη =X? + 5x5 -3X1X2 + 4X1 -X2 +7 (a quadratic expression). 
Example 10.10. Write the following in vector, matrix notation: 


Uy = 2x? — 5X3 + X2 - 2X4X4 + 3X1 - X; + 4; 
Uy =X] -X + Xy; 


Uz =X? + 2x3 — X$ + 4X2X3. 
Solution 10.10. Here, u, is a quadratic expression 


u,=X'AX+a'X+b_ where 


2 -1 ο 
X'2(45,x4X4, a’ =(3,-1,0), b=4, A=]-1 -5 0 
0 O 1 


Itis a quadratic expression. 
u-a'X, a'-(L-11, Χ' =(%,X>,X3). 


It is a linear form. 


10 0 
U3 = X'AX, X! = (X1,X5,X3), A- 0 2 2 
ο 2 -1 


This is a quadratic form. In all these cases, the matrix of the quadratic form is written 
in the symmetric form. Any quadratic form in real variables will be of the form X' ΑΧ, 
where X' = (x,,...,X,), A= (aj) and X'AX is a scalar quantity or a 1 x 1 matrix, and 
hence it is equal to its transpose. That is, X' AX = (X' AX)! = X'A'X and, therefore, 


1 
X'AX= ΞΙΧΑΧ +X'A'X] -χ(---μ = X'BX, 
B= ΞΑΞΑ!) = Bl 


and hence the result. 


When we havea sample from anormal population, we can derive many interesting 
and useful results. For a full discussion of quadratic forms and bilinear forms in real 


10.5 Linear forms and quadratic forms ---- 287 


random variables, see the books [12] and [13]. We will need only two main results on 
quadratic forms, one is the chi-squaredness of quadratic forms and the other is the 
independence of two quadratic forms. 


Result 10.13 (Chi-squaredness of quadratic forms). Let χι.... Hue be iid random 
variables from a normal population N(0,0?). Let y = X'AX be a quadratic form, 
where X' = Q«,...,x,) and A = (αι) = Α' be a matrix of constants. Then the neces- 
sary and sufficient condition for u — c to be chi-square distributed with r degrees of 
freedom or u ~ x? is that A is idempotent and of rank r. 


Proof. For any real symmetric matrix A, there exists an orthonormal matrix Q, 
00’ =I, Q'Q =I, such that Q'AQ = D = diag(A,,...,A,,) where A,,...,A,, are the eigen- 
values of A. Hence by making the transformation Y = Q'X we have 


X'AX = Y'DY - Ayj + +++ +Apyp. (10.52) 


Also the orthonormal transformation, being linear in X, will still have y;, j = 1, ...,p 


independently distributed as N(0, 07). Hence n ~Xi,j=1,...,p. If A is idempotent 
of rank r then r of the Aj's are unities and the remaining ones are zeros, and hence 
Ex - E (yj + + y2) ~ X2. Thus, ifA is idempotent of rank r then u ~ x2. For proving 
the converse, we assume that wax ~ x? and show that then A is idempotent and of 
rank r. The mgf of a chi-square with r degrees of freedom is (1— 20-7 for 1-2t > 0. But, 
from (10.41) each Y is x7, with mgf (1 -- 20) with 1--2ἱ > 0, forj = 1,...,pand mutually 
independently distributed. Further, A;y?/o* has the mgf (1— 207? with 1 — 2A;t > 0, 


j=1,...,p. Then from (10.52) and the x2, we have the identity 


p 
]a-2y5? = 0-207. (10.53) 
j=1 


Take the natural logarithm on both sides of (10.53), expand and equate the coefficients 
of (2t), QtY^, ... we obtain the following: 


p 
o» =r, m=1,2,... 
jal 


The only solution for the above sequence of equations is that r of the A,’s are unities 
and the remaining ones are zeros. This condition, together with the property that our 
matrix A is real symmetric will guarantee that A is idempotent of rank r. This estab- 
lishes the result. 


Note 10.10. Observe that if a matrix has eigenvalues 1’s and zeros that does not mean 
that the matrix is idempotent. For example, take a triangular matrix with diagonal 
elements zeros and ones. But this property, together with real symmetry will guarantee 
idempotency. 


288 —— 10 Sampling distributions 


Note 10.11. Result 10.13 can be extended to a dependent case also. When x, ... Xp 
have a joint normal distribution X ~ N,(0,2), E = Σ »0, X! = (αι... , Xy), a corre- 
sponding result can be obtained. Make the transformation Y - X-iX then the prob- 
lem will reduce to the situation in Result 10.13. If X ~ N,(j, 2), u + O then also a corre- 
sponding result can be obtained but in this case the chi-square will be a non-central 
chi-square. Even if X is a singular normal, that is, |Σ| = O then also a corresponding 
result can be obtained. For such details, see [12]. 


As a consequence of Result 10.13, we have the following result. 


Result 10.14. Let x,,...,x,, be iid variables distributed as N(yu, 0°). Let 


n v2 n = p n 
(x; — X) [G5 - 9) -ᾱ-μ)] Ε 
== lM -Y9;-y* 
ja 0 j=l g j=1 
where y; = "P. Then 
n 
u=)(yj-y¥), yj-N(0.1) (10.54) 
j=l 
and 
μ-χλι. (10.55) 


Proof. Writing xjo-X- ία -p)- (X-y)and then taking y= at , we have the represen- 


tation in (10.54). But (10.54) is a quadratic form of the type Y’ AY where A =I - BI : 
L' = (4,1... 1) which is idempotent of rank n - 1. Then from Result 10.13 the result 
follows. 


Another basic result, which is needed for testing hypotheses in model building 
situations, design of experiments, analysis of variance, regression problems, etc. is 
the result on independence of quadratic forms. This will be stated next. 


Result 10.15 (Independence of two quadratic forms). Let x,, ... ος be iid variables 
following a N(0, o?) distribution. [This is also the same as saying X ~ Ny(O, σ΄, X' = 
(Χι»...» Xy) where I is the identity matrix and 0? > Ois ascalar quantity.] Let u = X' AX, 
A =A' and v = X' BX, B=B' be two quadratic forms in X. Then these quadratic forms 
u and v are independently distributed if and only if (iff) AB = O. 


Proof. Since A and B are real symmetric matrices, from Result 10.13 we have the rep- 
resentations 


u=X'AX =Ayi +: + Apy} (10.56) 


10.5 Linear forms and quadratic forms ---- 289 


and 
ν νι! += + Vyp (10.57) 


where λι,...,λρ are the eigenvalues of A; vj,...,v, are the eigenvalues of B, and y; ~ 
N(0,1),j=1,...,p and mutually independently distributed. Let us assume that AB = O. 
Then due to symmetry, we have 


ΑΒ-Ο-Ο' =(AB)'=B'A'=BA = AB-BA 


which means that A and B commute. This commutativity and symmetry will guarantee 
that there exists an orthonormal matrix Q, QQ’ =I, Q’Q =I such that both A and B are 
reduced to their diagonal forms by the same Q. That is, 


O=AB = Q'AQQ'BQ=D,D,, 
D, =diag(A,,... Ap); D, = diag(v,,... »νρ). 


But D,D, = O means that whenever a Àj + 0 the corresponding v; = 0 and vice versa. 
In other words, all terms in u and v are mathematically separated. Once a set of sta- 
tistically independent variables are mathematically separated then the two sets are 
statistically independent also. Hence u and v are independently distributed. This is 
the sufficiency part of the proof. For proving the converse, the "necessary" part, we 
assume that u and v are independently distributed. We can use this property and the 
representations in (10.56) and (10.57). By retracing the steps in the “sufficiency” part, 
we cannot prove the “necessary” part. There are many incorrect proofs of this part in 
the literature. The correct proof is a little more lengthy and makes use of a number 
of properties of matrices, and hence we will not give here. The students may refer to 
Mathai and Provost [12]. 


Remark 10.2. One consequence of the above result with respect to a simple random 


sample from a normal population is the following: Let x,,...,x, be iid N (μ, 0?) vari- 
G-p) 


ables. Let u = i Σ}ιία) - xy and v = Z (X - p". Taking y; = 


j=1,...,nand iid. Then 


, we have yj^ N(0,1), 


m 
1 
v2 1 1 1 
u-zY(y,-yP-Y'AY, A=I--LL', L'=(4,...,1 
2 y) : ( ) 


v - (y)? = Y' BY, B--LU, Y! -(γι,....γῃ). 


Observe that AB = O = u and v are independently distributed, thereby one has the 
independence of the sample variance and the square of the sample mean when the 
sample comes from a Ν(μ, 0?) population. One can extend this result to the indepen- 
dence of the sample variance and the sample mean when the sample is from a N(j, 0?) 
population. 


290 —— 10 Sampling distributions 


Exercises 10.5 


10.5.1. Let the p x 1 vector X have a mean value vector u and positive definite covari- 

ance matrix X, that 15, E(X) = u, Cov(X) = X = X! 50. Show that Y = ΣΧ-» E(Y) = 
1 1 

X 2p, Cov(X) =I and for Z = Y - Z 2µ, E(X) = O,Cov(Z) - I. 


10.5.2. Consider the quadratic form Q(X) = X'AX and Y and Z as defined in Exer- 
cise 10.5.1. Then show that 
Q(X) = X'AX = Υ'Σ2ΑΣΣΥ 
-(Z- X3) XiAXi(Z Xp) (10.58) 


10.5.3. Let P be an orthonormal matrix which will diagonalize the symmetric matrix 
of Exercise 10.5.2, ZIAZ? , into the form 


P'ZiAXiP-diag(,...,A), PP=I, P'P-I 


where A, ... A, are the eigenvalues of Σ2ΑΣ2. Then show that Q(X), the quadratic 
form, has the following representations: 


p 
Q(X) -X'AX-Y A,(uj+bj), A-A', u£0 
j=l 


p 
=) Au, ΑΞΑ’, p=0 (10.59) 


1 
where b' = (b,,...,b,) -μ'Σ-7Ρ. 
10.5.4. Illustrate the representation in Exercise 10.5.3 for Q(X) = 2x7 + 3x5 - 2x,x, and 
A=). 


10.5.5. Singular case. Let the p x 1 vector X have the mean value E(X) = u, Cov(X) == 
of rank r < p. Since X here is at least positive semi-definite, we have a representation 
ΣΞ ΒΒ’ where B is p x r of rank r. Then one can write X = u + BY with E(Y) = O and 
Cov(Y) = I. Show that any quadratic form Q(X) = X’ AX has the representation 


Q(X) = X' AX = (u + BY) A(g + BY) 
=p! Au+2Y'B'Au+Y'B'ABY forA - A'. (10.60) 


Obtain a representation for Q(X), corresponding to the one in Exercise 10.5.3 for the 
singular case. 


10.5.6. Let the p x1 vector X, X' = (x,,... , Xy), be distributed as a multivariate normal 
of the following type: 


1 


— e- 2 X-'x Qj) 
||? (2π)/3 


f(x) 


10.6 Order statistics — 291 


where —oo < Xj < 00, —00 < Hj < oo, X» 0, μ' = (μι, Mp) Show that for this non- 


singular normal the mgf is given by 
My(T) =e? 155127 (10.61) 
where Τ' = (t,,...,t,) is a parametric vector. (Hint: My(T) = E [e X].) 


10.5.7. Taking the mgf in Exercise 10.5.6 as the mgf for both the non-singular case 
X > 0 and the singular case X > O show by calculating the mgf, or otherwise, that 
an arbitrary linear function y = a'X, a’ -(αι,... 145; X' 208; ,Xy) has a univariate 
normal distribution. 


10.5.8. If an arbitrary linear function y = α͵ X, a’ = (a,,... jd); X! = (6... ,Xy), has 
a univariate normal distribution, for all constant vectors a, then show that the p x 1 
vector X has a multivariate normal distribution of the type determined by (10.61) and 
in the non-singular case, has the density as in Exercise 10.5.6. 


10.5.9. Let the p x 1 vector X have a singular normal distribution (which also includes 
the non-singular case). Let the covariance matrix X of X be such that X = BB’ where B 
is a p x q, q 2 p matrix of rank r < p. Let Q(X) = X' AX. Show that the mgf of Q = Q(X) 
has the representation 


r r 
Molt) = [Πα = 205) | exp} at + 2? Y bjü - zeny, uzo 
ja jl 


-[[ü-28, µ-ο (10.62) 


where the A,’s are the eigenvalues of B' AB, b! = (b, ..., b,) = u' A' BP, a=p' Ay. 


10.5.10. Let X ~ N,(0,I) and X'X = X' A,X +X' A,X, A, = Aj, Ay = A}, where Αι is idem- 
potent of rank r < p. Then show that X' A,X ~ x7, X'A)X ~ Xĝ-r and X' A,X and X' ΑΧ 
are independently distributed. [This result can also be extended when we have the 
representation J = A, + --- + Αχ and this will help to split the total variation to sum 
of individual variations due to different components in practical situations such as 
analysis of variance problems.] 


10.6 Order statistics 


In a large number of practical situations, the items of interest may be largest value or 
the smallest value of a set of observations. If we are watching the flood in the local 
river, then the daily water level in the river is not that important but the highest water 
level is most important or water levels over a threshold value are all important. If you 
are watching the grain storage in a silo or water storage in a water reservoir serving a 
city, over the years, then both the highest level and lowest levels are very important. 


292 —— 10 Sampling distributions 


If you are running an insurance firm then the maximum damage due to vehicular col- 
lision, largest number of accidents, largest number of thefts of properties are all very 
important. The theory of order statistics deals with such largest values or smallest val- 
ues or the r-th largest values, etc. Since numbers are simply numbers and there is not 
much to study there, we will be studying some random variables corresponding to 
such ordered observations. 

Let x;,...,x, be a simple random sample of size n from some population. Then 
χι... X4) is a collection of random variables. Consider one set of observations on 
ba, ..., x4). For example, let x be the waiting time for a particular bus at a local bus 
stop. Assume that this bus never comes earlier than the scheduled time but it can 
only be on time or late. On 5 (here n = 5) randomly selected occasions let the waiting 
times be 3,10,15,0,2, time being measured in minutes. If we write these observations 
in ascending order of their magnitudes, then we have 


0<2<3<10<15 (i) 


Again, suppose that another 5 (same n = 5) occasions are checked. The waiting times 
may be 2, 5, 8,5, 10. If we order these observations, then we have 


3<5<5<8<10 (ii) 


If we keep on taking such 5 observations each then each such set of 5 observations can 
be ordered as in (i) and (ii). There will be a set of observations which will be the small- 
est in each set, a set of observations which will be the second smallest and so on, and 
finally there will be a set of observation corresponding to the largest in each set. Now 
think of the set of smallest observations as coming from a random variable denoted by 
X51, the set of second smallest numbers coming from a random variable Χῃ.», etc. and 
finally the set of largest observations as coming from the random variable x,,.,,. Thus, 
symbolically we may write 


Xni S Xn2 $77 € Xan (10.63) 


Since these are random variables, defined on the whole real line (--οο, co), there is no 
meaning of the statement that the variables are ordered or one variable is less than 
another variable. What it means is that if we have an observed sample, then the n 
observations can be ordered. Once they are ordered, then the smallest will be the ob- 
servation on x,.,, the second smallest will be the observation on x,.; and so on, and 
the largest will be an observation on x,,.,,. From the ordering in (i) and (ii) note that the 
number 3 is the smallest in (ii) whereas it is the 3rd smallest in (i). Thus, for example, 
every observation on x,., need not be smaller than every observation on x,.; but for 
every observed sample of size n we have one observation each corresponding to x,., 
for r =1,2,...,n. Now, we will have some formal definitions. 

Let us consider a continuous population. Let the iid variables x, ...,x,, come from 
a population with density function f(x) and distribution function F(x) (cumulative 


10.6 Order statistics — 293 


density). How can we compute the density function or distribution function of x,,.,, 
the r-th largest variable or the r-th order statistic? For example, how can we compute 
the density of the smallest order statistic? 


10.6.1 Density of the smallest order statistic x,., 


We are considering continuous random variables here. We may use the argument that 
ifthe smallest is bigger than a number y then all observations on the variables x,, ... , Χῃ 
must be bigger than y. Since the variables are iid, the required probability will be a 
product. Therefore, 


Prix, 4 ΣΥ) = Prpa > y} Prix > y} Prix, > y) = [Prog > y}]" 
since the variables are iid. But 
Prix; > y) 21- Pri; <y}=1- ΕΦ) (10.64) 


where F(y) is the distribution function of x evaluated at the point y. But the left side 
is 1- the distribution function of x,.,, denoted by 1 - Fa) (y). Therefore, the density 
function οἴχῃ, denoted by fa) (y), is given by 


d d 
fal, Ξ---[1-Εω0}] =-—[1-F(y)]" 
αι dy Y=Xn:1 dy γΞΧη1 
fay Xna) -π[1- Fa)" !f(4), -σο < x4 < oo. (10.65) 


Here, f (x,.,) indicates the population density evaluated at the observed point of x,., 
and F(x,.,) means the population distribution function evaluated at the observed x,,.,. 


10.6.2 Density of the largest order statistic x,., 


Again we are considering continuous random variables. Here, we may use the argu- 
ment that if the largest of the observations is less than or equal to y then every obser- 
vation must be <y. This statement, translated in terms of the random variables is the 
following: 


Prix, < y) = Prix, < y} Prix, sy} 
= [Prix; < y}]" = [Fo]. 


Hence the density of x,.,,, denoted by fin C), is given by 


d β΄. 
foo nn) = gy Prüma syl = ALFONSO acy 


Y—Xnin 


= nE nn) ! f n:n). (10.66) 


294 — 10 Sampling distributions 


10.6.3 The density of the r-th order statistic x,., 


Here also one can use an argument similar to the one in Sections 10.6.1 and 10.6.2. 
But it will be easier to use the following argument. Think of the subdivision of the 
x-axis into the following intervals: (--οο, Χῃ.,)» Xn:r Xn:r + AXnir)s (Xn:r + AX4,,00), Where 
Ax. is a small increment in x,,,. When we say that an observation is the r-th largest 
that means that r — 1 are below that or in the interval (--οο,χῃ.;), one is in the interval 
(Xs. Xy, + AX$,) and n - r observations are in the interval (x, + Ax,.,, 0o). Let pj, p; 
and p; be the respective probabilities. These probabilities can be computed from the 
population density. Note that p; > 0, i = 1,2,3 and p, +p» + p = 1 because we haven > 1 
observations. Then from the multinomial probability law the density of x,.,, denoted 
by fin (%ur), is given by the following multinomial probability law: 

n! 
! 


fo (Χαιν)άΧη.ν = T = py Paps” 3 


lim 
AXy,0 (r a 1) 
But 


pi = Pr{-00 «x; < Xnr} =F Xn:r) 


lim p= lim Pr(x,, € X; < Xnr + AX pir} = f Xn:r)dXn:r 
Axy,0 Xn:r >O ` d : ` ` 7 


lim Pie lim οὐ Χαν + AX. < Xj < o0] = 1- Ε(Χῃν). 


AX, Xnr? 


Substituting these values, we get 


Λη (Xn:r) = T [Εμ] ! [1 i FJ] δι) 


n! 
(r - !(n-r 
iz I(n4 1) 
I(r)IT(n-r 1) 
x [1- F(x,,)]" "f(x,,,). (10.67) 


[E On] 


Note that f(,)(-) is the density of the r-th order statistic, f(x,,.,) is the population density 
evaluated at x,., and F(x,,.,) is the population distribution function evaluated at x,,.,. 
Note that the above procedure is the most convenient one when we want to evaluate 
the joint density of any number of order statistics, that is, divide the real line into 
intervals accordingly and then use the multinomial probability law to evaluate the 
joint density. 


Example 10.11. Evaluate the densities of (1) the largest order statistic, (2) the smallest 
order statistic, (3) the r-th order statistic, when the population is (i) uniform over [0,1], 
(ii) exponential with parameter 0. 


Solution 10.11. (i) Let the population be uniform over [0,1]. Then the population den- 
sity and distribution function are the following: 


10.6 Order statistics — 295 


1, O<x<l1 x, O<x<l 
ra=] ra=] 
1 >1 


0, elsewhere; 


From (10.65), the density of the smallest order statistic is given by 
fao) -nl-yl*5, OsysLy-x 
and zero elsewhere. From (10.66), the density of the largest order statistic is given by 
fq) -ny*5, OzysLy-X 


and zero elsewhere. From (10.67), the density of the r-th order statistic is given by 


n! 


r- Dm - ο ᾱ yy", OxysLy-Xs. 


Λο) m ( 


(ii) When the population is exponential the density and distribution function are 
the following: 


2 0, -co«x«0 
10) - F(x)- Xie 
0, elsewhere; 1-e**", Oxx«oo. 


Hence the density for the largest order statistic is given by 


" 1 
ζω y)- πα ad gode ἘΠΕ Y 7 Χηιη: 


The density for the smallest order statistic is given by 


= 1 n 
fay) =nfer/?]" LX = go y 7 Xnr 


It is interesting to note that the density of the smallest order statistic in this case is 
again an exponential density with parameter 0/n or if the original population density 
is taken as f(x) = 0e 9** χ»0, 0» 0 then the density of the smallest order statistic 
is the same with 0 replaced by n0. This, in fact, is a property which can be used to 
characterize or uniquely determine the exponential density. 


Example 10.12. A traveler taking a commuter train every morning for five days every 
week has to wait in a queue for buying the ticket. If the waiting time is exponentially 
distributed with the expected waiting time 4 minutes, then what is the probability that 
for any given week (1) the shortest waiting time is less than one minute, (2) the longest 
waiting time is more than 10 minutes, time being measured in minutes? 


Solution 10.12. From the given information, the population density is of the form: 


f= Tenth, t>0 


296 —— 10 Sampling distributions 


and zero elsewhere, and hence the distribution function will be of the form F(x) = 
1-e-*/* for x > 0. Then the density for the smallest order statistic is of the form: 


E 5 
fa =ni - Fo)" fy) = ze, Υ20 Y= Xan 
The probability that we need is Príy < 1}. That is, 
15 
Prly «1 - | =e gy =1 - e-5/^. 
o 4 
Similarly, the density for the largest order statistic is 
z 1 
Foy) = n[Fo]" FQ) =5[1- e^ * reri 
The probability that we need is Pr{y > 10). That is, 
Pr{y 210] = 2 | [(1-e¥/4}4eV/4dy 
A ho 
=5 | [1-4e + 6e?! — 4e 2" κα Je du 
25 
=5e* | -26-27 (2e? - e775 + ze" . 


Example 10.13. For the r-th order statistic x,., show that the density can be trans- 
formed to a type-1 beta density. 


Solution 10.13. Let the population density and distribution function be denoted 
by f(x) and F(x) respectively and let the density for the r-th order statistic be denoted 
by fin O), y = χη... Then it is given by 


= n! rp n-r = 
fy) = gc EDT FON FO). ym 


Let u = F(y) 2 du = dF(y) = f (y)dy. Then the density of u, denoted by g(u), is given 
by 


μι (1-uy", O<u<1 


n! 
(r - 1)!(n - r)! 
i Γίη 1) 
τῶι 1-1--χ) 


g(u) = 


uot ο... O<u<1 


and zero elsewhere, which is a type-1 beta density with the parameters r and n +1-r. 


Remark 10.3. From the structure of the density function of the r-th order statistic, 
it may be noted that the distribution function F(x) (cumulative density) of the orig- 
inal population is involved. Hence if the population is gamma, generalized gamma, 


10.6 Order statistics — 297 


Raleigh, Maxwell—Boltzmann, general pathway density, etc. then F(x) may not havea 
simple explicit analytic form or it may go into incomplete gamma functions, and then 
the explicit evaluation of the moments, etc. of the r-th order statistic for r = 1,2,...,n 
may not be possible. Even if the population is standard normal still the distribution 
function F(x) is not available analytically. One may have to evaluate in terms of in- 
complete gamma function or go for numerical evaluations. But the transformation 
u = F(x), as in Example 10.13, will reduce the density in terms of type-1 beta density or 
in terms of type-1 Dirichlet density in the joint distribution of several order statistics. 
Hence properties of order statistics can be studied easily in all situations where one 
can write x = F !(u) explicitly. This is possible in some cases, for example, for uniform 
and exponential situations. If x is uniform over [a, b], then 


F(x) = L— and u=F(x) 5 x=(b-a)u. (10.68) 


If x is exponential with parameter 0, then 
F(x)=1-e*%/? and u-F(xX) => κχ--θῖπί --ι). (10.69) 


Example 10.14. Evaluate the h-th moment of the r-th order statistic coming from a 
sample of size n and from a uniform population over [0, 1]. 


Solution 10.14. When the population is uniform over [0,1], then the distribution 
function 


roo - | fedt - [at -x O<x<1 


Hence the cumulative density at the point x,., is U Χῃιν, and the density for x,,., is 
given by 


Γίη 1) 
Γ()Γ(η 151--χ) 


fy) = ακρη oye ose 


and zero elsewhere. Hence the h-th moment of the r-th order statistic x,., is given by 


1 
Epis = Ely") i ip y fiy (y)dy 


ιδ = E ahi ος y\n+l-r-1 
- T(nI(n«1-r) E ey) - 


B I(n- 1) Γ(Υ X h)r(n«1-r) 
TI(r)Pn-1-r) T(n 414 Rh) 

ο T(n«1) ΤΟΠ) 

~T(n+1+h) T(r) 


, R(h)>-r 


or h > ~r if his real. 


298 —— 10 Sampling distributions 


10.6.4 Joint density ofthe r-th and s-th order statistics x,., and Χῃ.ς 


Let us consider two order statistics x,.,, the r-th and x,.,, the s-th order statis- 
tics for r «s. Then divide the real axis into intervals (-00,x4,,), (xy. Xs. + AXn:r)» 
(Xn:r + AX per Χηις), (Χῃ.ς» Χῃις + Δχῃις), Gs + Δχῃις, 00). Proceed exactly as in the deriva- 
tion of the density of the r-th order statistic. Let the joint density of x,., and x,., be 
denoted by f(y,Z), y = x.,, Z = X44. Then we have 
f(y.z)dy ^dz = = [Fon]! 
(r - 1)1!(s - r — 1)'1!(n - s)! 
x [F&) - Fo)]* "i - Fe"? 


x f of G)dy ^ dz. y = xs. Ζ = Xni? (10.70) 


for -co < y < z < co and zero elsewhere. As in Example 10.13, if we make the transfor- 
mation u = F(y), v = F(z) then the joint density of u and v, denoted by g(u, v) is given 
by the following: 
T(n * 1) r-1 s-r-1 
u,v) = wu (v-u 
sY)= Tore-nrasi-sy Ὁ 
x(-v"t*, O<u<v<l (10.71) 


and zero elsewhere. If we make a further transformation u = u,, v - u =U, then the 
joint density of u and v is changed to g; (u4, 12). given by 


T(n+1) 
T(r)I(s - r)T(n*1- s) 


g, (u, 112) = uut i-um (10.72) 
for O <u; <1,i=1,2,0 <u, + u, < 1 and gj(u,u;)- 0 elsewhere, which is a Dirichlet 
density. If we make a further simplification r=1,,s—r=r),n+1-s=n+1-1-h=73, 
then r; +r, +73 =n +1. Thus the density g; (u,,u,) becomes 


Try ΕΥ} +r) nai n- 


a Wh EBL! lá Qu; yn- 
£i) = zer yr)" uj (1-u,-u) (10.73) 


for O <u; < 1, O < uj +u; < 1. Then (10.73) is in the usual format ofa type-1 Dirichlet den- 
sity. Now we can extend to consider the joint density of the γι = r,-th, y; = (r, + r;)-th, 
yp = (1, + +++ +r )-th order statistics. Let the joint density be denoted by g(y,, ... , Yp). 
Then we have 


T(n -- 1) 
Γ(ΤΙ)Ε(12) --- I(rg44) 
x [F(V2) - Fy)] ^ -- [1 Fp) |" 
xf(yy) fü). (10.74) 


[Fy] 


βίνι»...»γρ)Ξ 


10.6 Order statistics —— 299 


Now, make the transformations v, = F(yj), vj = F (y)... Vp ΞΕ (yp) and then make the 
transformation u; = v;, U = V; - Vi, ..., Up = Vp - Vj ,. Then the joint density of u,,...,U, 
will reduce to a type-1 Dirichlet density of the form: 


u,) = T(n 4 1) r-1 
"PU TT) Epa) ! 


gi(u,,... 


1 


= -1 
" u? - ug (-1ι-::.- up) n! (10.75) 


for ry +++ Ty =n+1,0<u;<1,j=1,....p,0<su,+--+uysl. 


Example 10.15. Construct the joint density of x,.4, the smallest order statistic, and 
Χρ. the largest order statistic for a sample of size n from a population with distribu- 
tion function F(x). Show that it is a density. Then construct the density of the range = 
largest order statistic minus the smallest order statistic, when the population is uni- 
form over [0,1]. 


Solution 10.15. In the general formula in (10.70) put r = 1and s = n to obtain the joint 
density of the largest order statistic y, = x,.,, and the smallest order statistic y, = x. 
Let the joint density be denoted by f (y1, y;). Then 


n! 
fis ¥2) = jaca 
x [Fy -FD ^ 1 - Fv) FF) 
-n(n- D[F(y3) - F(y)]" “FF Yə). (10.76) 


Let u = F(y,), v = F(y>). Then the joint density of u and v, denoted by g(u, v), is given 
by 


gü,v)-nn-D[v-uj^*?, O<u<v<1, n22. (10.77) 


Integrating out over γι and y, in (10.76) is equivalent to integrating out u and v in 
(10.77). Let us compute the total integral, denoted by q. 


q= | | g(u, v)du ^ ἂν - n(n - 1) [ ji (v- ΠΠ 


v=0 | Yu=0 


Putz = * and take out v. Then 


q-n(n-1) M τα. | [ - ay ds αν 


v'Mdv-1 forn22. 


_n(n-1) [ 


n-1 νο 


Hence f (y, y2) is a density since it is non-negative with total integral unity. When the 
population is uniform over [0,1] then the joint density of y, and y; is that in (10.77) 


300 —— 10 Sampling distributions 


with y, =u and y, =v. Let w = y; - y, and y = y>. Then the Jacobian is 1 and the joint 
density of w and y, denoted by gi(w, y), is given by 


g(w,y) -nn-1w"?, O<ws<y<l 


and zero elsewhere. Integrating out y, w « y « 1, the marginal density of w, denoted 
by h(w), is the following: 


h(w) 2 n(n - 1)w'? Τ dy =n(n-1)w"2(1-w), O<w<1 


and zero elsewhere. It is type-1 beta density with parameters (n — 1, 2). 


One can make another interesting observation. Consider a simple random sample 
Xi» ..., X4 from a population with density f(x). Suppose that we consider the joint den- 
sity of all order statistics y, = Xy.4, Y2 = Xņ:2> +» Yn = Xn:n- If all variables are involved, 
then the collection of the original variables {x,,...,x,,} and the collection of all order 
statistics {γι = X44, .... y, = X44) are one and the same. That is, 0x, ..., Xn} = {γ]»...»γῃ]}. 
The only difference is that in the set (x, ... , Χρ) the variables are free to vary. But in the 
set {γι,...»γῃ} the variables are ordered, that is, γι < y; < --- < y,. Given a set of vari- 
ables x;,...,x, how many such ordered sets are possible? This number is the number 
of permutations, which is n!. Hence n! ordered sets are possible. If integration is to 
be done for computing some probabilities, then in the set {γι = x,4, .... y, = Xn:n} the 
integration is to be done as follows: If the original variables have non-zero density in 
[a, b], then y, goes from a to y;. Then y, goes from a to y, and so on or if we are inte- 
grating from the other end then y, goes from y, , to b, y, ι from y, 2 to b and so on. 
The idea will be clear from the following example. 


Example 10.16. Let x,,..., x, be iid as exponential with parameter 0 or with density 
function 


e X/0 
fo =--> x20 0>0 
and zero elsewhere. Compute the joint density of all order statistics y, = x44, ..., 
Yn 7 Xnin* 
Solution 10.16. The joint density of x,,...,x,, denoted by f (x1, ..., Xn), is given by 
E. 1 7 

fo Xn) = an exp|- 56 e +x), 0 <x; «oco, 0» 0. 

The joint density of y,, ... , Yn, denoted by g(y,,...,Y,,), is then 


n! 


1 
δίγι» ...»Υῃ) = στ expf-501 eX) O <y SYS S Vy <œ, 


10.6 Order statistics — 301 


and 0 > 0. Let us verify that it is a density. It is a non-negative function and let us 
compute the total integral. For convenience, we can integrate out starting from y,. 
Integration over y, is given by 


n! (9? y n! Υπ. 
& |, o] ar. 


In the joint density there is already a y, , sitting in the exponent. Then for the next 
integral the coefficient of y, | is 2. Then the integral over y, | gives 


n! (99 2ys. n! 2γρ.. 
gi, tel pes eto p) 


Proceeding like this the last integral over y, is the following: 


n! 55 ny; . πιθ B 
DG- mn- i exp] 2. lay, 623-0 


This shows that it is a joint density. 


Exercises 10.6 


10.6.1. Let x,,...,x, be independently distributed but not identically distributed. Let 
the distribution function of x; be F;(x;), j =1,...,n. Consider observations on x, ... Χῃ 
and ordering them in ascending values. Let y, = x,., be the smallest order statistic and 
Y2 = X4, be the largest order statistic here. By using the ideas in Sections 10.6.1 and 
10.6.2 derive the densities of y, and y;. 


10.6.2. Let x have the density f(x) = 5x O <x <2 and f(x) = 0 elsewhere. Consider a 
simple random sample of size n from this population. Construct the densities of (1) the 
smallest order statistic, (2) the largest order statistic, (3) the 5-th order statistic, n » 5. 


10.6.3. For the problem in Exercise 10.6.2 evaluate the probability that x,.; < 0.5, 
Xn:4 215, Χῃις < 1 for πΞ 10. 


10.6.4. (1) Compute c so that f(x) = c(1 x, O < x < 1 and f(x) = 0 elsewhere, is a 
density; (2) Repeat Exercise 10.6.2 if the density in (1) here is the density there. 


10.6.5. Repeat Exercise 10.6.3 for the density in Exercise 10.6.4. 


10.6.6. During the raining season of June-July at Palai it is found that the duration 
of a rain shower, t, is exponentially distributed with expected duration 5 minutes, 
time being measured in minutes. If 5 such showers are randomly selected, what is the 
probability that (1) the shortest shower lasted for more than 2 minutes; (2) the longest 
shower lasted for less than 5 minutes; (3) the longest shower lasted for more than 10 
minutes? 


302 —— 10 Sampling distributions 


10.6.7. For the Exercise in 10.6.6, what is the probability that the second shortest 
shower lasted for more than 2 minutes? 


10.6.8. Let x4, Xs.» ..., X444 be all the order statistics for a simple random sample of 
size n coming from a population with density function f(x) and distribution func- 
tion F(x). Construct the joint density of all these order statistics x4, ... , X». 


10.6.9. The annual family income x of households in a city is found to be distributed 
according to the density f(x) = 5, 1< x < 10, x being measured in Rs10000 units 
which means, for example, x = 2 means of Rs 20 000. (1) Compute c; (2) If a simple 
random sample of 6 households is taken, what is the probability that the largest of 
the household income is more than Rs 80 000 or x,,., > 8? 


10.6.10. If x,,...,x, are iid Poisson distributed with parameter A = 2, construct the 
probability function of (1) the smallest order statistic, (2) the largest order statistic. 


10.6.11. Let x,,...,x, be iid as uniform over [0,6]. Compute the joint density of the 
order statistics y, = X,4, <--> y, = Xn:n and verify that it is a density. 


11 Estimation 


11.1 Introduction 


Statistical inference part consists of mainly estimation, testing of statistical hypothe- 
ses and model building. In Chapter 10, we developed some tools which will help in sta- 
tistical inference problems. Inference about a statistical population is usually made by 
observing a representative sample from that population and then making some deci- 
sions based on some statistical procedures. Inference may be of the following nature: 
Suppose that a farmer is planting corn by preparing the soil as suggested by the lo- 
cal agricultural scientist. The item of interest to the farmer is whether the yield per 
plot of land is going to be bigger than the usual yield that the farmer is getting by the 
traditional method of planting. Then the hypothesis is that the expected yield under 
the new method is greater than or equal to the expected yield under the traditional 
method of planting. This will be something like a hypothesis E(x,) > E(x;) where x, 
is the yield under the new method and x, is the yield under the traditional method. 
The variables x, and x, are the populations here, having their own distributions. If x, 
and x, are independently gamma distributed with the parameters (αι, βι) and (a5, 3), 
then E(x,) = a4f, and E(x,) = a>. Then our hypothesis is that αιβι > α2β2. How do we 
carry out a statistical test of the above hypothesis? We have to conduct experiments 
under the traditional method of planting and under the new method of planting and 
collect a sample of observations under both of these methods. This is the first step. 
Hence the very basic aspect of inference is a sampling procedure or to take a represen- 
tative sample, someway representing the whole population. There are many types of 
samples and sampling procedures. We have already discussed one type of sample, in 
Chapter 10, namely a simple random sample. There are other types of samples such as 
multistage samples, stratified samples, proportional samples and so on, and the cor- 
responding sampling plans are there. We will only consider inference based on simple 
random samples here. 

The student may be wondering whether it is essential to go for samples and why 
not look into the whole population itself and then take a decision. Sometimes it is 
possible to check the whole population and then take a decision. Suppose that a firm, 
such as HMT, has produced only 10 printing units. Suppose that the claimed weight 
per unit is 10 tons. It is not difficult to check this claim by weighing all the 10 units. 
Sometimes, even if we have only a finite number of units in a population it may not 
be possible to check each and every unit and come up with a decision. Suppose that a 
car manufacturer has produced 25 expensive cars, such as a new model of Ferrari. The 
manufacturer's claim is that in case of frontal collision the damage to the car will be 
less than 1096. One cannot test this claim by a 10096 checking or 100% testing because 
checking each item involves destroying the car itself. Suppose that the manufacturer 
of a new brand of electric bulb says that the expected life time of the new brand is 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. EABAR] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https://doi.org/10.1515/9783110562545-011 


304 — 11 Estimation 


1000 hours. Here, the life-time x may be exponentially distributed with mean value 
0. Then the claim is that 0 = 1000. One cannot do a hundred percent testing of this 
claim because each observation can be made only by burning the bulb until it is burnt 
out. There will not be anything left for sale. Hence one has to go for a representative 
sample, study the sample and test the claims by using this sample observations only. 

As another example, suppose that you want to test the claim that in the local 
Meenachil River the chance (probability) of flood (water level going above a thresh- 
old value) in any given year is less than or equal to 0.2. Here, even if you wish to do a 
1009^ checking it is not possible because you do not have all the past records and you 
do not have access to the future data on flood. Hence one has to go for some sort of 
a sampling scheme and collect some data and based on this data make inference or 
decisions. Sometimes it may not be worth going for a 100% survey. Suppose that the 
claim is that the average annual income per household in Kerala is less than or equal 
to Rs 10 000. Suppose that there are about one lakh households. It is not wise of you 
to go for a 10096 survey to collect this single item about the expected annual income 
because it is not worth the money spent. Thus, even if it is possible to conduct a 10096 
survey, it may be unwise and may not be worth it to do so. 

The first topic in statistical inference that we are going to consider is statistical es- 
timation problem. The idea is to use a representative sample from a given population 
and then estimate some aspects of the population. If the population under consider- 
ation is waiting time, t, at a given bus stop and if t is exponentially distributed with 
expected waiting time some unknown quantity 0 then our aim here may be to come up 
with an estimate of the expected waiting time, or to come up with an estimate of this 
unknown quantity 0. Here, we want to estimate a parameter in a specified population. 
Suppose that in the same situation of waiting time being exponentially distributed, 
someone is interested to estimate the probability that the waiting time there is greater 
than or equal to 5 minutes, time being measured in minutes. Then this probability, p, 
is given by the following: 


ας 25 
p= Pr{t >5}= | ae 6dt=e s -p(0). 
Here, this is a function of the unknown parameter 0 and hence we can look upon this 
problem as the problem of estimating a probability or estimating a function of a given 
parameter. 

As another example, we can look at the growth of a child, measured in terms of 
height. The growth pattern may be more or less a straight line model until the age of 
10 to 11. Then all of a sudden, the child shoots up and attains the maximum height by 
11 or 12. There is a change-point in the growth pattern, somewhere between 10 and 11 
years of age. One may be interested to estimate this change point. As another example, 
consider x, the amount of gold in every ton of gold bearing rock in a mountain range. 
The density function of x itself is unknown and we would like to estimate the den- 
sity function itself. Such an estimation problem is called *density estimation". Thus, 


11.2 Parametric estimation —— 305 


so far, we have considered the problem of estimating a parameter in a well-defined 
distribution, estimation of a certain probability, estimation of a parametric function, 
estimation of a density itself and the estimation of a change-point in a model building 
situation. Then there are problems where one may be interested in testing whether 
two variables x and y are statistically independently distributed (this is called testing 
independence), one may want to check and see whether the occurrence of a defec- 
tive item in a production process is a random phenomenon or occurring according to 
some specific pattern (this is called testing for randomness), one may want to measure 
the association or affinity between two variables, etc. All these problems involve some 
estimation of some characteristics before a testing procedure can be devised. 


11.2 Parametric estimation 


Out of the various situations that we considered above, one situation is that we have 
a well-defined population or populations and well-defined parameter or parameters 
therein. The estimation process involves estimating either such a parameter or param- 
eters or a parametric function. 


Definition 11.1 (Parametric estimation). Estimation of a parameter or parameters 
or functions of parameters, coming from well-defined populations, is known as a 
parametric estimation problem. 


Even here, there are different situations. We may want to estimate the expected 
life-time of a machine part, the life-time may be exponentially distributed with ex- 
pected value ϐ hours. In this case, we want to estimate this @. Theoretically this ex- 
pected life time could be any real non-negative number, or in this case 0 « 0 « co. 


Definition 11.2 (Parameter space). The set of all possible values a well-defined pa- 
rameter or parameters can take is called the parameter space, usually denoted by ©. 


For example, in an exponential density there is usually one parameter 0 where 0 < 
0 < co. Here, the parameter space Ω = (0 | 0 < 0 < oo). In a normal population N (u, σ2), 
there are two parameters u and o? and here the parameter space is 0 = {(u,0 2) | -00 < 
U< oo, 0 < 0°? < oo}. Consider a gamma density with shape parameter a, scale param- 
eter B and location parameter y. Then the parameter space is given by Q = {(a,f, y) | 
0<a<oo, O «f « co, —co «y < co). Thus the set of all possible values the parameters 
can take will constitute the parameter space, whatever be the number of parameters 
involved. 

You see advertisements by commercial outfits, such as a toothpaste manufacturer 
claiming that their toothpaste will reduce cavities, (whatever that may be), by 21 to 
46%. If p is the true percentage reduction, the manufacturer is giving an interval say- 


306 —— 11 Estimation 


ing that this true unknown quantity is somewhere on the interval [0.21, 0.46]. A single 
number is not given but an interval is given. A saree shop may be advertising saying 
that in that shop the customer will save between Rs 500 and Rs 1000 on the average 
per saree, compared to other shops. If the expected saving is 0, then the claim is that 
this unknown 0 is somewhere on the interval [500,1000]. An investment company 
may be advertising that the expected return in this company will be 10% for any money 
invested with that firm. If the true percentage return is p, then the estimate given by 
that firm is that p - 0.10, where p is the estimated value of p. A frequent traveler is 
claiming that the expected waiting time at a particular bus stop for a particular bus is 
10 minutes. If 0 is the unknown expected waiting time, then the traveler is claiming 
that an estimate for this 0 is 10 or Ó = 10. 

We have looked into several situations where sometime points or single units are 
given as estimates for a given parameter and sometime an interval is given, claiming 
that the unknown parameter is somewhere on this interval. The first type of param- 
eter estimates, where single points are given as estimates for given parameters, are 
called point estimates and the procedure is called point estimation procedure. When 
an interval is given saying that the unknown parameter is somewhere on that interval, 
then such an estimate will be called an interval estimate and the procedure is called 
an interval estimation procedure or the procedure for setting up confidence intervals. 
In the case of interval estimation problem if two parameters are involved, then a re- 
gion will be given saying that the two-dimensional point is somewhere on this region. 
Similarly, if k parameters are involved, then k-dimensional Euclidean region will be 
given, which will be then called a confidence region for k = 2,3, .... These will be de- 
fined properly in the next chapter. 

We will start with point estimation procedure in this chapter. But before dis- 
cussing point estimation procedures we will look into some desirable properties that 
we would like to have for point estimates. We have already defined what is meant by 
a statistic in Chapter 10. 


Definition 11.3 (Point estimators). If a statistic is used to estimate a parameter or 
a parametric function then that statistic is called a point estimator for that para- 
metric function and a specific value assumed by that estimator is called a point 
estimate. An estimator is a random variable and an estimate is a value assumed by 
that random variable. 


A desirable property of a point estimator is the property known as unbiasedness. 
The property comes from the desire of having the estimator coinciding with the para- 
metric function, on the average, in the long run. For example, if someone is throwing 
a dart at a small circle on a dart board then an estimate of the probability of hit is 
available from the relative frequency of hit. Repeat the experiment 100 times and if 45 

45 


hits are there then the relative frequency of hit or the average is 155 = 0.45. Repeat the 


11.2 Parametric estimation —— 307 


experiment another 100 times. If 52 hits are there, then an estimate of the true proba- 
bility of hit p is the estimate p = i5 = 0.52. We would like to have this average or the 
relative frequency agreeing with the true value p in the long run in the sense when 
the same experiment of throwing 100 times is repeated such batches of 100 throws 


infinitely many times. 


Definition 11.4 (Unbiasedness of estimators). Let T = T(x,, ...,x,) bean estimator 
of a parametric function g(0), where 0 ε O some parameter space. If E[T] = g(0) for 
all values of 0 € Q, then we say that the estimator T is unbiased for g(0) or T is an 
unbiased estimator for g(0). 


If E(T) = g(0) holds for some values of 0 only, such as 0 > 0, = 5, then T is not 
unbiased for g(0). The condition must hold for each and every value of 0 in the whole 
parameter space O. In Chapter 10, we have seen that when we have iid variables or 
simple random sample from a population with mean value μ and variance o? then the 
sample mean x is unbiased for the population mean value μ, as long as p is finite, and 
Maj - x)*/(n - 1) is unbiased for the population variance, as long as o? < co. That 
is, for all values of u < co, whatever be the population, 


E[x] = u < co (11.1) 
and 


E| Nj τ xy 


| edite. (11.2) 
n-1 


for σ΄ finite. These are some general results, irrespective of the populations. 


n —Y) 
Remark 11.1. Due to unbiasedness of the statistic T = DELI for the population 


variance σ΄, some people are tempted to define this T as the sample variance. But 
this approach is inconsistent with the definition of population variance as o? - 
E[x — E(x)]? take, for example, a discrete distribution where the probability masses 
1/n each are distributed at the points x,,...,x, then 


a0 - X 3 10-3 


E[x - Ε()] = (11.3) 


n-1 


Besides, variance is the square of a “distance” measure, measuring per unit scat- 
ter from a point of location, namely, E(x) if we consider the population and x if we 
consider sample values. Thus the dividing factor should be n, not n - 1. Also, unbi- 
asedness may not be a desirable property in many situations. Suppose that someone 
wants to cross a river. If he has the depths at every meter across the river, the average 
may be half a meter and this information is of no use to cross the river. The deepest 
point may be 5 meters deep. 


308 — 11 Estimation 


Example 11.1. Compute the expected value of (1): (-2)* when x is a Poisson random 
variable with parameter A > 0; (2): (-1)* when x is a binomial random variable with 
parameters (p), O «p«1,q-1-p,p« T and construct unbiased estimators for e~? 
and (1 - 2p)", respectively, and make comments. 


Solution 11.1. (1) For the Poisson case the probability function is 
AX 
βα)- ον, A>0, x =0,1,2,... 
and f,(x) = 0 elsewhere. Then the expected value of (-2)* is given by the following: 
co AX n : co (-2λγ΄ 
Ε -2 x E -2 Xi A = A 
CS ee 


=e4te Ale Q<e 3.21 


Thus the statistic (-2)* is unbiased for e~* in this case. Note that forA>0,0<e* < 1 
whereas (-2)* is 1 or 2 or > 1 or a negative quantity. Hence this statistic T(x) = (-2)* 
is a meaningless estimator for 6-7), even though it is an unbiased estimator for 6-7), 
(2) For the binomial case, the probability function is given by 


μα. O«p«1,q-1-p,n-0,1,...,n 


and f(x) = 0 elsewhere. Then the expected value of (-1)* is given by the following: 


E((-1)")= Σω 
x=0 
= (q - p" =(1- 2p)". 


But for p < UE 2p)" > 0 and not equal to +1. The estimator is (-1)* = 1 or -1, and 
hence this unbiased estimator is a meaningless estimator for the parametric function 
(1— 2p)". 


Remark 11.2. Many such unbiased estimators for parametric functions can be con- 
structed as in Example 11.1. Hence one should not take unbiasedness as a univer- 
sally very desirable property for estimators. 


Definition 11.5 (Estimable function). Let g(0) bea parametric function. If there ex- 
ists at least one unbiased estimator for g(0), then we say that g(0) is estimable. If 
there is no T = T(x,, ...,x4) such that E(T) = g(0) for all 6, then we say that g(0) is 
not estimable. 


Suppose that we are constructing an estimator for E(x;) = p as a linear function of 
the iid variables x,,...,x,,, then any estimator in this class of linear functions will be 


11.2 Parametric estimation —— 309 


of the form u = a,x, + ++: + ἄρχῃ. Then estimation of μ in the class of linear functions 
will imply that 


Ε(ι)Ξ-μξαιμε: +a > αι" +a,=1. 


That is, any unbiased estimator for μ in this linear class must satisfy the condition 
αι * +a, =1, which will then be the estimability condition here. 


Example 11.2. Construct an unbiased estimator for (1) p? where p is the expected bi- 
nomial proportion; (2) A? where A is the mean value in a Poisson population; (3) 67 
where 0 is the mean value in an exponential population. 


Solution 11.2. (1) Let x be binomially distributed with parameters (n, p), O< p « 1. 
Then we know that 


E|xx-2]2n(n-Dp^ = Elu - E[ ο] 2 


n(n -- 1) E 
Hence u = aH here is an unbiased estimator for p?. 
(2) Let x be a Poisson random variable with parameter A. Then we know that 


E[xx-1)] =A? = u=x(x-1) 


is unbiased for A? when x is a Poisson variable. 

(3) Let x be an exponential random variable with mean value 0. Then we know 
that the population variance is 6. Let x,,...,x, be iid variables distributed as the ex- 
ponential variable x. For any population with finite variance σ΄, we know that 


Hence u = 37, oray. is unbiased for 67. 
Definition 11.6 (Bias in an estimator). Let T = T(x,,...,x,) be an estimator for a 
parametric function g(0). Let the expected value of T be ϱι(θ). If T is unbiased, 
then g,(@) = g(0). Otherwise ῥ(θ) = 6ι(θ) — g(0) is called the bias in this estimator 
for estimating g(0). 


Another desirable property can be defined in terms of a measure of distance. Let 
Xi»... Xn be a simple random sample from a population with parameter 0 and let g(0) 
be a function of 0 to be estimated. Let T, = T,(x,,...,x,) and T, = T(x, ..., x4) be two 
estimators for g(0). A measure of distance between T, and g(6) and that between T; 
and g(0) are the following: 


E(IT,Gs.....x,) - €()|']*, r21, i=1,2. (11.4) 


310 —— 11 Estimation 


Ifthe distance between an estimator T and a parametric function g(0) is small, then T 
isclosetog(0) or we say that it is a good estimator or smaller the distance more efficient 
the estimator. If the distance is zero, then that is the best estimator but distance is zero 
only for deterministic situations or for degenerate random variables. Hence we will 
adopt the terminology “smaller the distance better the estimator". A criterion called 
relative efficiency is defined in terms of the above distance measure in (11.4) for r = 2 
and by using the square of the distance measure when r - 2. It is only for convenience 
that r = 2 is taken. 


Definition 11.7 (Relative efficiency of estimators). Let T, and T, be estimators for 
the same parametric function g(0). If E[T, — g(0)]? < E[T; — g(0)]? for all 0, then we 
say that ΤΙ is relatively more efficient than T, for estimating g(0). 


Remark 11.3. If T, and T, are unbiased for g(0) also, then E[T; — g(0)? = Var(T;) 
and then the criterion becomes the following: If Var(T,) < Var(T5), then we say that 
ΤΙ is relatively more efficient than T; in estimating g(0). 


Example 11.3. Let x;,...,x, be iid variables with E(x) = u and Var(x;) = σ᾽ < co. Con- 
sider the estimates (8) μη = 2x, — x5, Uy = 304 +X2); (b) Uy 224 +X, Uy = 1-7 Xp; (c) Uy = 
2X4 — X5, Uy 2 X + X3. Which is more efficient in estimating u? 


Solution 11.3. (a) Let us compute the squared distances from p. Let the distances be 
denoted by d, and d,, respectively. Then 
di = E[u; - p}? = Ex, - x; - uf = Ε[2ίαι - 1) - 6 - Ww)? 
= E[46 - 4 + 6G - p}? - 404 - u)6o - y] 
=40* +0*-0=507 


since the covariance is zero. Similarly, 


2 
B = Elu - p]? =E[ 565 εχ) -u| = zElea και! 
= ZB U(x, - 10? + Oe =M)? + 20 100 -μ)] 
= lio? +02 +0] ig 
4 2 


Hence u, is relatively more efficient than u, in this case. Note that both u, and u, are 
unbiased for μ also in this case. 
(b) 
dj = Eu, - y]? = Ely +x, - ul? = Ε[ίαι - 1) + 6 - u) +y]? 
ΞΕ[ίαι - 9] + Ε[(2 - μ)] +H? + 2E [68 - 106 - μ)] 
+ 2ME|(x, --μ)] + 24E[66 - μ)] 


11.2 Parametric estimation —— 311 


=07 +07 +u? +0 =20? +p, 
di = E[u, - yl? = Ε[χι - x; - μι; = Ε[ίαι - p) - 6 - p) -μ]7 
205 +07 +u? -0=207 4 y. 


Here, both u, and u, are equally efficient in estimating u and both are not unbiased 
also. 
(c) 
di = E[u, - 4]? = Ex, - x; - 4]? = 507. 


This is already done above. Note that u, is unbiased for u also here. 
dj = E[u, - uj = Epa * x; - p}? = 20? + p°. 


This is computed above. Note that u, is not unbiased for py. If 50? > 20? +p? or 30? > pr’, 
then u,, even though not unbiased, is relatively more efficient in estimating μ. 

The smaller the distance of an estimator T from a parametric function g(0) bet- 
ter the estimator T for estimating g(0). If T is unbiased also then it is equivalent to 
saying: smaller the variance better the estimator. Now, if we consider the class of un- 
biased estimators then there may be an estimator having the least variance. Then that 
estimator will be a very good estimator. Then, combining the two properties of un- 
biasedness and relative efficiency one can come up with a desirable estimator called 
minimum variance unbiased estimator or MVUE. We will discuss this aspect later on. 

Observe that all the desirable properties of point estimators that we have de- 
scribed so far, namely unbiasedness and relative efficiency, are fixed sample size 
properties or the sample size n remains the same. But we may want to look at es- 
timators if we take more and more sample values or when the sample size n goes 
to infinity. Then such a property can be called “large sample property”, compared 
to fixed sample size or *small sample properties". If we are estimating a parametric 
function g(0) and if T = T(x,,...,x,) is an estimator for g(0), then it is desirable to 
have the probability of having |T — g(0)| < e, where e is a very small positive quantity, 
is one or nearly one for every sample size n, that is, if we have one variable x, or two 
variables χι, X2, etc. But this may not be possible. Then the next best thing is to have 
this property holding when n — co. This means that the probability that T goes to g(0) 
goes to one when n goes to infinity or 


Pr|T-g(0|«e] ^D1 asn—oo or lim Pr([T - g(0)| « e] 2 1. 
n> 


Definition 11.8 (Consistency of estimators). Let T = T(x;,...,x,) be an estimator 
for a parametric function g(0). Then if lim, ,,, Pr(|T, - g(0)| > €} = 0 for every e > 0 
or Pr(T, 5 g(0)) 5 1as n > œ or lim, ,,, Pr(|T, - g(0)| < €} = 1 for every e > 0 how- 
ever small it may be, or T, converges to g(0) in probability then T, is called a con- 
sistent estimator for g(0) and the property is called consistency of estimators. 


312 —— 11 Estimation 


Hence one can say that consistency is nothing but stochastic convergence of an 
estimator to the parametric function to be estimated, when the sample size n goes to 
infinity. From Chebyshev's inequality of Chapter 9, we have the following result: Let 
T, = TO, ...,x,) be an estimator for g(0). Let E(T,) = g,(0) ifitis not unbiased for g(0). 
Then from Chebyshev's inequality, 


Var(T,) 

ko ` 
Then we have the following possibilities. Either E[T,,] = g,(0) = g(0) and Var(T,,) — 0 
as n > œ or E(T,) = g,(0) 5 g(0) and Var(T,,)  O as n — oo. In both of these situa- 
tions, T, goes to g(0) with probability 1. Hence a practical way of checking for consis- 
tency of an estimator is the following: Check for 


Pr(|T, - E(T,)| =k} < (11.5) 


E[T64,...,.x,)] 5g(0) asn oo (i) 


Var(T,) ~0 asn co (ii) 
then T, is consistent for g(0). 


Example 11.4. Check for the consistency of (1) sample mean as an estimator for the 
population mean value p for any population with finite variance σ΄ < oo, (2) sample 
variance as an estimator for the population variance when the population is Νίμ, 07). 


Solution 11.4. (1) Let x,,...,x, be iid variables with E (κ) = and Var(x;) = 0? « co. 
Let the sample mean x - Hes + + Xn). Then we know that E(x) = u and Var(x) = E 
Then from Chebyshev's inequality, 


Varx) o? 


Prix -u| 2 k} < Ie “ae 


-0 asn- oo. 


Therefore, x is consistent for u whatever be the population as long as σ΄ < co. 
(2) The sample variance 


n -- φγ2 e 
vay δ A. ae Ελ] - 299? 0? asn > oo 


for all populations with finite variance. For computing the variance of s^, we will con- 
sider the case when the population is normal, N(j, 07). For other populations, one has 
to work out the variance separately. When the population is normal one can use the 
properties of chi-square variables because (see Chapter 10) 


n = 
(x; - XY 2 
——53 — " XnA 
with 


E[y2_,]=(n-1) and Var(y2_,)=2(n-1). 


11.2 Parametric estimation —— 313 


Hence 


n 22 n (x, xy? 
Var(s?) = Ἕνα Σω, -»?) = a va( Σ ο) 


ja jl 
4 4 
σ 2(π-- 1)o 
-u Var(¥n1)= 3 - 0 as n co. 
2 n ES 2 
σ (x; - X) σ 
ε(οὴ- ΕΣ | πει 
n 2 o? n 
2(n — 
ασ ΠΤ P usns 
n 


Hence from (i) and (ii) above, s? is consistent for the population variance σ΄. Also from 
(i) and (ii), the following result is obvious. 


Result 11.1. If Τη is a consistent estimator for g(0), then b, T, is consistent for bg(0) 
when b, — b and b? goes to a finite constant when n > oo. 


Example 11.5. Consider a uniform population over [0,0] and a simple random sample 
of size n from this population. Are the largest order statistic x,,., and the smallest order 
statistic x,., consistent for 0? 


Solution 11.5. See Section 10.5 in Chapter 10 for a discussion of order statistics. Let 
γι = X44 and y; = x, ,. The distribution function F(x) of x, when x is uniform over [0, 0], 
is given by 


0, -οο«χ«0 
F(x)2415. O<x<0 
1 x28. 


The densities for the largest order statistic, denoted by fn) (y2), and the smallest order 
statistic, denoted by fo) (y;), are given by the following: 


d d [x] 
fava = Fo" = = [FI 
(n V2 dx [ ] = dx l@ =a 
nx! ny 
- , Oxys0,n21 
Qd, g : 
and zero elsewhere, and 
d ʻi n xe 
fg9)---Fe| -Ξ|1-5| 
(0S1 dx [ ] y 0 0 ius 


314 —— 11 Estimation 


and zero elsewhere. 


Ely;]- zl γαίνα)᾽ idy, 


n gti n 


E e 


0—0 asn- oco. 


9 n+2 
Hnn 2 prog 
Ely] = rl, X0 n 
= 02. 
^ n«2 
Therefore, 
Vary) = Eb?] - [Ey - —-e - — e 
τη Wl = h2 (n +1) 
2 
-0 L (0 asnoo 
n+2 (n+1)? 
since lim, ,,, 45 = 1 and limpo παρ Ξ1. Thus, from properties (i) and (ii) of Def- 


inition 11.8, the largest order statistic here is consistent for 0. Now, let us examine 
Yı = X44 the smallest order statistic. 


on 0 B7 n-1 
Ε[νι]- 5 | [1 - dy, 


0 
; γ 
: al u-u tdu, u=% 

0 0 

EGER - 2 1:0 asn oo. 

T(n+2) n+ : 

θ 

21... n 2 γι NEN 2 n-1 
Εντ] al xh ο. dy, - n0 fu (1-u)™ ldu 
2 
E. Γ(3)Γ(1) κ 2n0 Tu o 


Γίη + 3) ~ n(n + 1)(n +2) 


Here, E(y,) - 0 as well as Var(y,) — 0 as n 5 co. Hence y, is not consistent for 0. 
Another desirable property of an estimator is known as sufficiency. Let T = 
T(x,,..., X4) be an estimator for a parameter 0. If the joint density/probability func- 
tion of the sample values x,,...,x,, given the statistic T, meaning the conditional 
distribution of x,,...,x,, given the statistic T, is free of the parameter 6, then all the 
information about 6, that can be obtained from the sample Χι,..., Χῃ, is contained in 
T because once T is known the conditional distribution of x,, ...,x,, becomes free of 0. 
Once T is known there is no need to know the whole sample space if our aim is to say 
something about 0. For example, for n = 2, knowing the sample means knowing every 
point (x4,x5) in 2-space where the joint density/probability function of (x,, x,) has the 
non-zero form. Let T = x, +X, the sample sum. Then x, + x; = a given quantity such as 


11.2 Parametric estimation —— 315 


q ntX-2 


Figure 11.1: Reduction of data. 


χι + X = 2is only a line, as shown in Figure 11.1. Knowing T = x, + x; means knowing 
only the points on this line. This enables us a large reduction of the data points. If T 
is sufficient for 0, then it means that if our aim is to say something about 0 then we 
need to confine our attention only on the line x, + x; = 2 and need not worry about the 
whole (x;,x;)-plane. This is a very strong property. 


Definition 11.9 (Sufficient estimators). An estimator T = Τ(Χι»..., Χρ) is called suffi- 
cient for a parameter θε Q if the conditional joint distribution of the sample values 
Xp ...»Χῃ» given T, is free of 0 for all values of 0 € Q, where Q is the parameter space. 
If r statistics T;,..., T, are such that the conditional distribution of x,, ... ,x,, given 
Τι»... T, is free of the parameters 0,,...,0.,, r > sorr < s, then we say that {T}, ..., T,] 
are jointly sufficient for {0,,...,0,}. 


Example 11.6. Let x,,...,x,, be a simple random sample from N(j, 1). Let x = Hes + 
+++ + Xn). Is X sufficient for u? 


Solution 11.6. Joint density of x,,...,x,, denoted by f(x,,...,x,), is given by the fol- 
lowing since the variables are iid. 


Εμ... Χ) = 


Note that 
n n 
Σα) -u}= Σα, - Xy +n(x - yy. 
ja ΕΙ 
When Xj ~N (4,1), we know that x ~ Ν(μ, 1), The density function of x, denoted by 
f, OO, 15 the following: 
vn -2 (ž- 2 
X) = — e7 11), 
fi i ) Vin 2 


The joint density of x,,...,x,, and x is the same as the joint density of x4, ... , x, because 
no independent variable is added by incorporating x, which is a linear function of the 
sample values x,,...,X,. Hence the conditional density of x,,...,x,, given x, denoted 


316 —— 11 Estimation 


by £0, ..., X4]X), is given by the following: 


_ joint density of x,, ..., x, and x 


£O, ....X4IX) 


{1) 
_ joint density of xj, ...,Xx, /ίΧι»...»Χῃ) 
f) fia) 


(V2 " exp(- 5 X465 - 3)? - 56 -μγη 


γπ[(ν2π) exp(- 5 (X - μ)»} 
= γπ(νσπ) αν exp| 5 Σα; - 9 (1.6) 


Εἰ 


which is free of μ and hence x is sufficient for u here. 


Example 11.7. Let x,,...,x, be iid as a uniform over [0,6]. Let y; = max{x,,...,x,} and 
y; = minfx,, ...,X,} (other notations for largest and smallest order statistics). Is y; suf- 
ficient for 0? 


Solution 11.7. From Section 10.5 of Chapter 10, the density for the largest order statis- 

tic is given by the following: 

ny; 
θη’ 

The joint density of x,,...,x,, which is the same as the joint density of x,,...,x, and 

max(x;, ..., X4] = Y2, is given by 


fm) = 0xy,x0,nz1. 


1 ; 
/ίαι»...»Χῃ) = an’ 0xx;s0,jz1l....n. 


Hence the conditional density, g(x;,...,xX,|T), of x,,...,X,, given T = y;, is the follow- 
ing: 
f... Χῃ) 1 g 1 


= - = ; (11.7) 
Λη) 'r.. 0" ny lym nm" 


go, ...,x,JT-2m)- 


for m + 0, which is free of 0. Hence the largest order statistic, namely, Χμ} or 
max (x, ..., X4] is sufficient for 0 here. 

Observe that if T is given, that is, T - c where c is a given constant, then c,T =c,c 
is also given for c, + 0 because cT = εις = a constant. Also if we take a one to one 
function of T, say, h(T), T e h(T) then when T is given h(T) is also given and vice 
versa. This shows that if T is sufficient for 0 then h(T) is also sufficient for 0. 


Result 11.2. If T isa sufficient estimator for 0 and if T to h(T) is a one to one function, 
then h(T) is also sufficient for 0. 


11.2 Parametric estimation —— 317 


For example, consider Example 11.6. Let h(T) = 2x — 5 = given = 7, say, then this 


means that x = rs = 6 = given or vice versa. 


Example 11.8. Let x,,...,x, be iid as a Poisson with probability function 
AX 
f= ee x=0,1,2,...,A>0 
x! 
and zero elsewhere. Let u = x, +---+X,. Is u sufficient for A? 


Solution 11.8. The joint probability function of x,, ..., x, is given by 


and zero elsewhere. But the moment generating function (mgf) of x; is given by 
Alet- 
M,,(t) =e ee). 


which shows that the mgf of u =x, + --- +x, is a Poisson with parameter πλ, and the 
probability function of u is the following: 


(πλ)" 
u! 


fiu) = e", u=0,1,... 

and zero elsewhere. Note that the joint probability function of x,,...,x,, and u is the 
same as the joint probability function of x,,...,x, since u 15 a function of xj, ..., x, and 
not an independent variable. The conditional probability function of x,,...,x,,, given 
u, is then 


Q6, .... X4) 
(Χι»...»Χῃ|{) = ————  forfií(u)z0 
f 1 nl ) fiu) fi 
Mem u! 
σκι x! (πλ) ο” Πλ 
πι 
UE. (=) foru=m (11.8) 
x! < Xa! n 


which is free of A, and hence u = x, + --- + x, is sufficient for A. 


Note 11.1. Note that for a given u = m the last part of (11.8) 15 a multinomial prob- 
ability function with p, =--- =p, = i and x, + --- + x, =m. Hence the conditional 
probability function is a multinomial probability law here. 


318 —— 11 Estimation 


Example 11.9. Consider a simple random sample x,,...,x,, of size n from a N(,0?) 
population. Show that the sample mean x and N08 - X) are jointly sufficient for 
the population parameters u and 0°. 


Solution 11.9. When the population is N(u, σ΄), we know from Chapter 10 that μι = 

ja ία - X) and u, = x are independently distributed. Hence the joint density of u, and 
u, is the product of the marginal densities, denoted by f,(u,) and f,(u,), respectively. 
Further, we know that u, =x -Νίμ, =) Hence 


__vn σος (x 2 
f(u) mn (x - Jy. 


We also know from Chapter 10 that 


and free of all parameters. Put u, = o?^u > du = d du,. Hence the density of u, denoted 
by fi (u4) is the following: 


u 2 
fi(u)du, = ——4— ——du 
g"72 2 T(= 


n-i 
[X urs) cuu É " 
Ec S EM — e 9 06 -Xxydu. 
olay (5) ja 


The joint density of u, and u; is f,(u,)f,(uz). The joint density of x,,...,x, and u, and 
u is the same as the joint density of x,,...,x,, which has the form in (11.6). Hence 
dividing by f; (u,)f,(u2) we obtain the conditional density of x4, ... , Xp» U1, U2, given 11 = 
a, Uy = b, denoted by g(x, ..., x,|u4 = a, Uy = b), is given by the following: Note that the 
exponential part in (11.6) will be canceled. 


vm TOS) 


Είκι»...»Χῃ]μη =a, u = b) = nel 
a2 


which is free of the parameters u and σ’. Hence μι = Èj - Xy and u, = x are jointly 
sufficient for u and 0°. 


11.2 Parametric estimation —— 319 


Note 11.2. Another concept associated with sufficiency is minimal sufficiency. 
There may be different sets of statistics which are jointly sufficient for a set of pa- 
rameters 0,,...,6,. The first set of statistics may contain 51 statistics, the second set 
may contain s; statistics and so on. Which set we should take in such a situation? 
One may be tempted to go for the set containing the minimum number of statistics. 
But observe that our aim of selecting a sufficient statistic is the reduction in data. 
Hence that set which allows the maximal reduction of data is the preferable set 
and such a set which allows the maximal reduction of data is called the set of min- 
imal sufficient statistics. But we will not go further into the properties of sufficient 
statistics. 


Before concluding this section, we will introduce a factorization theorem given 
by Neyman and it is known as Neyman factorization theorem. The students are likely 
to misuse this factorization theorem when checking for sufficiency by using this fac- 
torization, and hence I have postponed the discussion to the end. The students are 
advised to use the definition of conditional distributions rather than going for the fac- 
torization theorem. 


Result 11.3 (Neyman factorization theorem). Let x, ...,x, bea sample coming from 
some population and let the joint probability/density function of xi, ... , x, be denoted 
by f (Xi; ...,x4,,0), where 0 stands for all the parameters in the population. Let the 
support or the range of the variables with non-zero probability/density function be 
free of the parameters 0. Let T = T(xi, ...,x,) be a statistic or an observable function 
of the sample values. (The sample need not be a simple random sample). If the joint 
probability/density function allows a factorization of the form 


πα ο) =H OY aos ο) (11.9) 


where f, (T, 0) is a function of T and 0 alone and f(x, ..., x,) is free of all parameters 
0 in the parameter space Q, then T is called a sufficient statistic for 0. 


The proof is beyond the scope of this book, and hence deleted. The students are 
advised to use the rule only in populations such as Gaussian where —co < x < co or 
gamma where 0 < x < oo etc where the range does not depend on the parameters, 
and not in situations such as uniform over [a, b] where a x x x b with a and b being 
parameters. 


Exercises 11.2 


11.2.1. Consider iid variables x, ...,x, distributed as uniform over [a,b], b > a. (1) 
Show that the largest order statistic is a consistent estimator for b; (2) Is the small- 
est order statistic consistent for a? Prove your assertion. 


320 —— 11 Estimation 


11.2.2. Let f(x) = cx?, 0 x x «0 and zero elsewhere. (1) Compute c if f (x) is a density; 
(2) compute Príy; > 0.8} and Príy, x 0.1} where y, and y; are the smallest and largest 
order statistics for a sample of size 5 coming from the population in (1). 


11.2.3. Let f(x) = Br, O <x < 0 and zero elsewhere. For a sample of size 6 from this 
population, compute (1) Pr{y, = eh (2) Pr{y, > 26}, where y, and y, are the smallest 
and largest order statistics, respectively. 


11.2.4. Consider the same population as in Exercise 11.2.3. Is the largest order statistic 
for a sample of size n from this population (1) unbiased for 0; (2) consistent for 0? 


11.2.5. Consider the function 


x^ 0«χ«θ 
fG02c40Q0-x) 0xxx20 
0, elsewhere. 


If f(x) is a density function then (1) compute c; (2) the density of (a) the smallest order 
statistic, (b) the largest order statistic, for a sample of size 3 from this population. 


11.2.6. For the population in Exercise 11.2.5 compute the probabilities (1) Pr{y, < ar 
(2) Prfy; = 30), where y, and y; are the smallest and largest order statistics, respec- 
tively. 


11.2.7. For the order statistics in Exercise 11.2.6 is γι or y; unbiased for 6; (2) consistent 
for 0? 


11.2.8. Show that (11.7) is a density function for max{x,,...,x,} =m =a given num- 
ber. 


11.2.9. Show that the conditional statement in (11.6) is in fact a density function for 
X =m =a given number. 


11.2.10. Let x,,...,x, be a sample from some population with all parameters denoted 
by 0. Let u, = X}; ...,u, =X, be n statistics. Then show that u,,...,u, are jointly suffi- 
cient for all the parameters 0. 


11.3 Methods of estimation 


There are many methods available in the literature for getting point estimates for pa- 
rameters of a given population. Some of the most frequently used methods will be 
described here. Usually one selects a particular method of estimation in a given situ- 
ation based on the following criteria: (1) Convenience and simplicity in the use of the 
particular method; (2) The estimators available through that method have some desir- 
able properties that you would like to have. Hence no particular method can be better 


11.3 Methods of estimation —— 321 


than another method or no method is the best method universally. Each method has 
its own motivating factors in being selected as a method of estimation. 


11.3.1 The method of moments 


From the weak law of convergence in Chapter 9, we have seen that for iid variables 
a sample moment converges to the corresponding population moment as the sam- 
ple size goes to infinity. Taking this property as the motivational factor the method 
of moments suggests to equate the sample moments to the corresponding population 
moments and by using such equations estimate the parameters involved in the popu- 
lation moments. Take as many equations as necessary. Let x,,...,x,, bea simple ran- 
dom sample. Then the sample moments are given by NR x =m,, r =1,2,.... These 
are the sample integer moments. If r here is replaced by a general h where h could be 
fractional or even complex, then we have the general sample moments. We are going 
to take only integer moments. Let the corresponding population moments be denoted 
by u} = E[x"] where E denotes the expected value. Now the principle of method of 
moments says to equate m, and u}. Naturally μ’ will contain the parameters from the 
population. That is, consider the equation 


m, =u}, r=1,2,... (11.10) 


Take as many equations as needed from (11.10) and estimate the parameters. Note that 
(11.10) does not hold universally or for all values of the parameters. It holds only at the 
estimated points. Hence when writing the right side in (11.10) replace the parameters 
0 by 6, meaning the estimated point. We will use the same notation Ó to denote the 
estimator (random variable) as well as the estimate (a number), the use will be clear 
from the context. 


Example 11.10. Estimate the parameters in (1) an exponential population, (2) gamma 
population, (3) in a Bernoulli population, when there is a simple random sample 
of size n. Construct the estimates when there is an observed sample {2,0,5} in (1); 
{5, 1, 0, 2} in (2); (1,0,0,0,1) in (3). 


Solution 11.10. (1) Let the population density be 
μα- ze, x20, 0>0 


and zero elsewhere. Then we know that E(x) = 0 = jj. The first sample moment m, = 

η z = x. Consider the equation (11.10) for r = 1 at the estimated point. That is,x = Ê 
or the parameter 0 here is estimated by x and x is the estimator here or the estimator 
by using the method of moments. For the observed sample point, {2,0, 5] the value of 


X= iQ 035) Ξ i, which is the estimate by using the method of moments. 


322 —— 11 Estimation 


(2) Here, the population is gamma with density function, denoted by f,(x), is the 
following: 
xt! Ἢ 


e 
p^r(a) ' 


and zero elsewhere. We know that 


fh) = 


a>0, B>0,x>0 


We may take two equations from (11.10) for r = 1,2 and perhaps we may be able to 
estimate the two parameters a and f by using these two equations. Since the equations 
are non-linear, there is no guarantee that two equations will enable us to solve for a 
and f. Let us try. 


- àf (i) 
= (ἃ + 1)(B^. (ii) 


n nox 
[âĝ]? + [B = X — 
ja” 
and hence 
n 2 n y 
ape- S ο (ii) 
jin ji n 
Hence from (i) and (iii), we have 
2 22 
β - A and from (i) and (iii) ᾱ- A (iv) 


where s? is thesample variance. The quantities given in (iv) are the estimators (random 
variables). For the observed sample point (5,1, 0, 2}, we have 


x-1641«042-2 
and 


g- 


[6 - 2)? + (1- 2? + (0 - 2?  (2- 27] 


Hence the estimates are f E OG - i and à = 4(2) = 5, 


11.3 Methods ofestimation —— 323 


(3) Here, the population is Bernoulli with the probability function 
fod =p*q'*, x=0,1,0<p<1,q=1-p 


and zero elsewhere. Here, E(x) = οὐ xp*q'* = p. Perhaps one population moment is 
sufficient to estimate p since there is only one parameter. The first sample moment is 
mj-x- Hes +-+- +Xn), Which is actually the binomial proportion because the sample 
sum in the Bernoulli case is the binomial random variable x and then x becomes the 
binomial proportion. Therefore, the estimator by the method of moment, also called 
the moment estimator, is the binomial proportion p = " where x is the binomial vari- 
able with n number of trials. For the observed sample point {1,0,0, 0,1}, the sample 
proportion is He +0+0+0+4+1)= 5- Hence the estimate in this case or the moment 
estimate is p = 5- 


11.3.2 The method οἵ maximum likelihood 


Another method of estimation is the method of maximum likelihood. The likelihood 
function L is defined in Chapter 10 (Definition 10.2). It is the joint density/probability 
function of the sample values at an observed sample point. This procedure requires 
the maximization of the likelihood function, with respect to all the parameters there, 
and thus select the estimates. The motivation behind this method is to assign those 
values to the parameters where the likelihood or the joint density/probability function 
of getting that sample point is the maximum. 


Example 11.11. Obtain the maximum likelihood estimators of the parameters (1) p in 
a Bernoulli population; (2) 0 in an exponential population. 


Solution 11.11. (1) The probability function in a Bernoulli population is 
παρ =p5q 5, x; =0,1, 0«p«1,q-1-p 


and zero elsewhere. The likelihood function 


n n 
L- | [psa = p% q ias =p*q"*, x= Σχ, 
j=1 j=l 


~ 


Since L to ln L is a one to one function, maximization of L is the same as maximization 
of InL and vice versa. In this case, 


InL-xlnp-4(n-x)ln(1- p). 


Here, the technique of calculus is applicable. Hence, differentiate In L with respect to 
p, equate to zero and solve for p to obtain the critical points. 


3 inpc > EM 


0 (i) 
dp p 1-Ρ 


324 — 11 Estimation 


~ X "E 
> p=-. (ii) 
n 
Here, x is the sample sum in the iid Bernoulli case, and hence x is the binomial random 
variable and, therefore, the only critical point here is the binomial proportion. Since 


(i) does not hold for all p the point where it holds is denoted by p. Now, consider the 
second-order derivative. 


d x (n-x) 
InL-|- 
dp" p {1-0} 


for p = p. Hence p = = corresponds to a maximum, and this p is the maximum likeli- 
hood estimator (MLE) of p here and an observed value of x will give an observed value 
of p and then it will be called a maximum likelihood estimate (MLE). We will use the 
same abbreviation MLE for the estimator as well as for the estimate. Similarly, a hat, 
that is, Ô will be used to denote the estimator as well as estimate of 0, the difference 
in the use will be clear from the context. 


Note 11.3. If we take a sample of size 1 from a binomial population then the popu- 
lation probability function is 


L =f(x)= (νο 


Xx20,1...,n,O«p«1,q-1- p. Then the MLE of p here is given by p = = the same 
as the one obtained in the Solution 11.11 where the population is Bernoulli and a 
sample of size n is available. But the likelihood function in the Bernoulli case is 


Laie, x= Osi consis O<jfa<il, G@=il=ja, 


Observe that the number of combinations (7) is missing here, but both L and L, will 
lead to the same MLE for p. 


(2) Here the density function is given by 


and zero elsewhere. Therefore, 


x 


n n 
L- [o9 » ze 93 > JlnL--nln6 0) 
F 


By using calculus, we differentiate and equate to zero to obtain the critical points. 


d n (Èj-1%) 
qp Ge 


0 (i) 


11.3 Methods of estimation —— 325 


n 
> Ê l Σαχ X = sample mean (ii) 
{15 


Taking the second-order derivative 


d? n 2nX 
w™-|5- Sl, 
n n 
= @pll-2l=- Ge <O 


Hence the only critical point Ê = x corresponds to a maximum or x is the MLE of 0 here. 


Example 11.12. Evaluate the MLE of u and o? in Ν(μ, 0?) population, assuming that a 
simple random sample of size n is available from here. 


Solution 11.12. The likelihood function in this case is the following: 


L= 1 e 3205-0" = 1 e E Σι) 


j1 σ(νσπ) | o" (N2n)" 


Then 


n 
EL 2 4 2 
InL- 5 Ino 2n Ἂς py. 


Here, there are two parameters u and 6 = σ’. We can apply calculus here. Consider the 
partial derivatives of InL, with respect to u and 6 = σ’, and equate to zeros to obtain 
the critical points. 


9 1 
gu ee > 52.5 1)Ξ0 


9 n pe 
—InL=0 > -=+ x; -x)* =0 
06 26 "Pu | 
ke AR. MU 
> δ- Hae) Oye ky 8 (ii) 


where s? is the sample variance. Hence there is only one critical point (4, 0?) = (x, s?). 
Taking all the second-order derivatives we have the following: 


2 
Lo InL=-2 
ou θ 
o? - 
= mnl] = (x, -xX)=0 
0004 x 82. 4 


326 —— 11 Estimation 


ο. n d > 
ο. θ᾽ Ἂς μη] 


p=x,0=s? 


1 57 n 
-— ;| A | = Hy <0. 
0912 8 2(s*) 
Hence the matrix of second-order derivatives, evaluated at the critical point (u, ϐ) = 
(X, s?), is the following: 
n 
0 E 2(s2)2 


which is negative definite and hence the critical point (u, 0) = (x, s?) corresponds to a 
maximum. 


Note 11.4. Instead of computing all second-order derivatives and checking for the 
negative definiteness ofthe matrix of second-order derivatives, evaluated at the crit- 
ical point, one can also use the following argument in this case. Examine In L for 
all possible values of μ, --οο < u < co and 0 = g? > 0. We see that In L goes from --οο 
to —co through some finite values. Hence the only critical point must correspond to 
a maximum because In L does not stay at —co all the time. 


Note 11.5. The students are likely to ask the following question: If we had differ- 
entiated with respect to o instead of 0 = o? would we have ended up with the same 
MLE ji = x and o? = s?? The answer is in the affirmative and the student may ver- 
ify this by treating σ as the parameter and going through the same steps. This is 
coming from a general result on differentiation. For any function g(0), we have 


d d d à 
aoe = [s e| Fa © 
and hence as long as 4 h(0) + 0 we have 
d d " 
qe) =0 > ane” =0 (ii) 


and vice versa. Hence both the procedures should arrive at the same result as long 
as h(0) is not a trivial function of 0. The following is a general result. 


Result 11.4. If Ô is the MLE of 0 and if h(@) is a non-trivial function of 0, that is, 
,h(0) + 0, then h(0) is the MLE of h(0). 


Example 11.13. Evaluate the MLE of a and b in a uniform population over [a, b], a « b, 
assuming that a simple random sample of size n is available. 


11.3 Methods ofestimation —— 327 


Solution 11.13. The likelihood function in this case is the following: 


1 
L=———, a<x%< 
(b - a)” 7 


and zero elsewhere. Note that the method of calculus fails here. Hence we may use 
other arguments. Observe that a < x; < b for each j. Maximum of quay means the min- 
imum of (b -- a)", which means the minimum possible value that can be assigned for 
b — a. When we substitute b = a function of x,, ... , x, and à = a function of x,, ... x, we 
should get the minimum possible value for b — à. This means that we should assign 
the minimum possible value for b and the maximum possible value for a. Since all 
observations are greater than or equal to a the maximum possible value that can be 
assigned to a is the smallest of the observations. Similarly, the smallest possible value 
that can be assigned to b is the largest of the observations. Then the MLE are 
à = x,4 = smallest order statistic, b = x,., = largest order statistic (11.11) 
Note 11.6. If the uniform density is written as f(x) = E a«x«b what are the 
MLE of a and b? If a « x < b, that is, x is in the open interval (a, b) then there is no 
MLE for either a or b because the observations can never attain a or b and hence no 
value from the observations can be assigned to a or b. If a « x x b, then the MLE for 
bis b = x, = the largest order statistic but there is no MLE for a. If a < x < b, then 
there is no MLE for b but the MLE fora is à = X,,., = smallest order statistic. 


Example 11.14. Consider an exponential density with scale parameter 0 and location 
parameter y. That is, the density is of the form 


_ ey) 
e 


fx)= θ50,Χ2γ 


θ » 
and zero elsewhere. Evaluate the MLE of 0 and y, if they exist. 


Solution 11.14. The likelihood function in this case is the following: 


L- me He”, xzyjzL...n θ2 0. 
By using calculus, we can obtain the MLE of 0, as done before, and Ó =x but calcu- 
lus fails to obtain the MLE of y. We may use the following argument: For any fixed 0, 
maximum of L means the minimum possible value for Maj - y), which means the 
maximum possible value that can be assigned to y because all observations x,, ... , x, 
are fixed. Since each observation χι > y, the maximum possible value that can be as- 
signed to y is the smallest of the observations or the MLE of y is y = x,.4, = smallest 
order statistic. 

There are several large sample properties for maximum likelihood estimators, 
which we will consider later on, after introducing some more methods of estimation. 


328 — 11 Estimation 


Note 11.7. The student is likely to ask the question that if the likelihood function 
L(0) has several local maxima then what is to be done? Then we take 0 = MLE as that 
point corresponding to the largest of the maxima or 5100, L(0) or the supremum. 


Another method of estimation, which is applicable in situations where we have 
data which is already classified into a number of groups and we have only the infor- 
mation of the type that n, observations are there in the first group, n; observations are 
in the second group or n; observations in the i-th group, i = 1,2, ..., k fora given k. An 
example may be of the following type: One may be observing the life-times of the same 
type of a machine component. n, may be the number of components which lasted 0 to 
10 hours, n; may be the number of components which lasted between 10 and 20 hours 
and so on, and n,y may be the number of components which lasted between 90 and 
100 hours. Let t denote the life-time then the intervals are 0 < t < 10,...,90« t < 100. 
Suppose that there is a total of n, + --- +n, =n observations, say, for example, n = 50 
components in the above illustration with n; = 1, n; 25, n4 23, n, = 6, N; = 5, ng = 6, 
n; = 6, Ng = 5, Ng = 10, Πρ 3. If the life-time is exponentially distributed with density 
ft) = le , t > 0, then the true probability of finding a life-time between 0 and 10, 
denoted by p, = p,(0) is given by 


ΡΙΞΡΙ()Ξ f c else. 


In general, let p;(0) be the true probability of finding an observation in the i-th group. 
The actual number of observations or frequency in the i-th group is n; and the expected 
frequency is np;(0) where n is the total frequency. We have a multinomial probabil- 
ity law with the true probabilities p,(0), ..., p, (0) with p4(0) + --- + p,(0) = 1 and the 
frequencies as n4, ..., ny with n; + --- +n, =n. The probability function, denoted by 
f (n, ..., ny) 15 the following: 


πώ ο 
ny! +++ Ny! 
with p;(0) 2 0,i=1,...,k, p4(0) + --- + p(0) = 1, for all 0 € Q (parameter space), n; = 
0,1,...,n, nj ++ «n, =n or it is a (k - 1)-variate multinomial law. We have the vec- 
tor of true probabilities P’ = (p,(0), ...,p;,(@)) and we have the corresponding relative 
frequencies Q’ = (4, πο κ ), where a prime denotes transpose. A good method of esti- 
mation of 0 is to minimize a distance between the two vectors P and Q. A measure of 
generalized distance between P and Q is Karl Pearson's X? statistic, which is given by 


k 
2 vf ίπι- πρι(θ))] > 
5 ΣΙ np;(0) = Xk-s-1 (11.12) 


which can be shown to be approximately a chi-square with k -s - 1 degrees of freedom 
under some conditions on n, k, p,(0), ... , p, (0). 


11.3 Methods of estimation —— 329 


Note 11.8. The quantity in (11.12) is only approximately a chi-square under some 
conditions such as np;(@) > 5 for all 0 and for each i = 1, ..., k, and k itself is > 5. 
When Κ - 2 note that 

(-np)? (m -np;? 


X’ = m > JD tiD = ha a gt 
πρι np), 


Let p; -p,thenp;-1-p-qandn,-n-n, Then 


(n, - npj? » (n-n, - πα - p)? " (n, - np) 


np; nq nq 
Then 
z 2 - 2 
X2 = (n np) | 1 | = (n np) (11.13) 
n p d npq 


which is the square of a standardized binomial random variable n,, having a good 
normal approximation for n as small as 20 provided p is near j. Hence X? will be 
approximately a x? for n 2 20, np > 5, nq > 5. Fork = 3 and k = 4 also for large values 
of n, one can have a good approximation to chi-square provided no p; is near to 
Zero or one. Students must compute the exact probabilities by using a multinomial 
law and compare with chi-square approximation to realize that strict conditions on 
n and np; are needed to have a good approximation. The conditions np; 2 5, k 2 5, 
i=1,...,k will be a sufficient condition for a good approximation, in general, but 
when p; = p;(@) is unknown then there is no way of checking this condition unless 
we have some information beforehand about the behavior of p;(0) for all 0. 


11.3.3 Method of minimum Pearson's X? statistic or minimum chi-square method 


As mentioned above, this method is applicable in situations where one has catego- 
rized data in hand. The principle is to minimize X? over 0 and estimate 0. Here, 0 rep- 
resents all the parameters involved in computing the exact probabilities in the various 
classes. Then 


«qm; - np;()? = 
WS - 7 (0) => 0-8. (11.14) 


From d 14), observe that minimization of X? for k = 2 and Di (0) = 0 gives the estimate 
as ὃ = 5, the binomial proportion. 

This Pearson’s X? statistic is quite often misused in applications. The misuse 
comes by taking it as a chi-square without checking for the conditions on n and p,’s 
for the approximation to hold. 


330 — 11 Estimation 


Estimators of parameters in a multinomial population can be done in a similar 
way as in the binomial case. If (x,,...,x;,) has a (k - 1)-variate multinomial law then 
the probability function is the following: 

n! x x 
Xy... X1) = ———p;' c pik, 
FOG ++ Xk) ici OPE 
where p > 0,i=1,...,k, δι +--+ +p, =1,n; =0,1,...,n, n ++ +n, =n. Since the sum 
Xi ++ X, is fixed as n, one of the variables can be written in terms of the others, that 
is, for example, xy =n- X — +: - xy , and py, = 1- p, - + - py_,. We know that 


E(x)-np, i-L..k = pj, 1-1.....Κ (11.15) 
n 


are the moment estimates. For computing the maximum likelihood estimates (MLE), 
either consider one observation (x, ...,x,.) from the multinomial law or iid variables 
from the point Bernoulli multinomial trials. Take In f (x, ...,x;) and use calculus, that 
is, differentiate partially with respect to p;, i = 1, ..., k, equate to zero and solve. This 
will lead to the MLE as p; = 7, i - 1,...,k, the same as the moment estimates. [This 
derivation and illustration that the matrix of second-order derivatives, at this criti- 
cal point (ψι,...,]Ρχ) = (3, ig zh), is negative definite, are left to the students. When 
trying to show negative definiteness remember that there are only k - 1 variables 
Xy, ..., Xy. and only Κ-- 1 parameters p4, ..., p, ,.] On the other hand, if we consider 
point Bernoulli multinomial trials, then we have iid variables and the i-th variable 
takes the value (Χῃ»...»Χκ,) where only one of x;;’s is 1 and the remaining are zeros for 
eachsuchn trials so that x; = ΣΤ xj, j = 1,...,k so that the likelihood function becomes 


L=p -pk 
and note that the multinomial coefficient xl is absent. Now, proceed with the 


maximization of this L and we will end up with the same estimates as above. Calculus 
can also be used here. 


11.3.4 Minimum dispersion method 


This method, introduced by this author in 1967, is based on the principle of minimiza- 
tion of a measure of “dispersion” or "scatter" or "distance". Let x, ..., x, be a sample 
from a specified population with parameter 0 to be estimated. Let T = T(x,,...,x,) be 
an arbitrary estimator for the parameter 0. Then E|T -- 0| is a measure of distance be- 
tween T and 0. Similarly, (E|T — θ|} : , r > 115 a measure of distance between T and 0 
or a measure of dispersion or scatter in T from the point of location 0. If there exists a 
T such that a pre-selected measure of scatter in T from the point of location 0 is a min- 
imum, then such a T is the minimum dispersion estimator. In Decision Theory, |T - 0| 
is called a “loss function" (the terminology in Decision Theory is not that proper be- 
cause (T — 0) is understood by a layman to be loss or gain in using T to estimate 0, and 


11.3 Methods of estimation —— 331 


then taking the expected loss as risk also is not that logical) and the expected value 
of the loss function is called the “risk” function, which, in fact, will be a one to one 
function of the dispersion in T from the point of location 0. Some general results asso- 
ciated with squared distances will be considered later, and hence further discussion 
is postponed. 

For the next properties to be discussed, we need to recall a few results from earlier 
chapters. These will be stated here as lemmas without proofs. 


Lemma 11.1. Let x be a real random variable and a be an arbitrary constant. Then 
that value of a for which E[x — a]? is a minimum is a = E(x) or min, E[x - a? > a= 
E(x), and that value of B for which E|x — B| is a minimum is B = the median of x, that 
is, ming Elx - B| = B = median of x. That is, 


min E[x - a? => a=E(x) (11.16) 


uds E|x-B| = B=median of x. (11.17) 


Lemma 11.2. Let y and x be real random variables and let g(x) be an arbitrary func- 
tion of x at a given value of x. Then 


minEy-sGO]| . — > OEV (11.18) 


x-given 


or g(x = b) is the conditional expectation of y at the given value of x = b, which will 
minimize the squared distance between y and an arbitrary function of x at a given x. 


Lemma 11.3. 


E(y) = E,[E(ylx)] (11.19) 


whenever all the expected values exist, where the inside expectation is taken in the 
conditional space of y given x, and the outside expectation is taken in the marginal 
space x. Once the inside expectation is taken, then the given value of x is replaced by 
the random variable x and then the expectation with respect to x is taken. This is the 
meaning of expectation of the conditional expectation. 


Lemma 11.4. 
Var(y) = E[Var(y|x)] + Var[E(ybo] (11.20) 


whenever all the variances exist. That is, the variance of y is the sum of the expected 
value of the conditional variance of y given x and the variance of the conditional ex- 
pectation of y given x. 


332 —— 11 Estimation 


11.3.5 Some general properties of point estimators 


We have looked into the property of relative efficiency of estimators. We have also 
looked into minimum dispersion estimators and minimum risk estimators. Do such es- 
timators really exist? We can obtain some general properties which will produce some 
bounds for the distance of the estimator from the parameter for which the estimator is 
constructed. One such property is in terms of what is known as Fisher's information. 
Consider a simple random sample from a population with density/probability func- 
tion f(x, 0). Then the joint density/probability function, denoted by L = L(x, ... χμ). is 
the following: 


L=| [%0 = InL=) Inf(x,,6). 
j=1 j=1 


Let us differentiate partially with respect to 0. For the time being, we will take ϐ as a 
real scalar parameter. Then we have 


n 


9 9 
—InL=) —1 »θ). 11.21 
3g 2 2.36 nf (x;,9) (11.21) 
Since the total probability is 1, we have 
| zl Ldx, ^ ^dx, =1 (11.22) 


when the variables are continuous. Replace the integrals by sums when discrete. We 
will illustrate the results for the continuous case. The steps will be parallel for the 
discrete case. Let us differentiate both sides of (11.22) with respect to 0. The right side 
gives zero. Can we differentiate the left side inside the integral sign? If the support 
of f (x,0) or the interval where f (x, 0) is non-zero contains the parameter 0, such as a 
uniform density over [0,6], then we cannot take the derivative inside the integral sign 
because the limits of integration will contain 0 also. Hence the following procedure 
is not applicable in situations where the support of f(x, 0) depends on 0. If we can 
differentiate inside the integral, then we have the following, by writing the multiple 
integral as k and the wedge product of differentials as dx, ^ --- ^ dx, = dX, where 
X! Ξ (&, ..., X), prime denoting the transpose: 


j Tax -0. (11.23) 


But note that of = [25 In L]L so that we can write (11.23) as an expected value 


J. wax = 6 > ME In rax = 0 


> E| Smt] =0. (11.24) 


Then from (11.21), 


11.3 Methods of estimation —— 333 


ΕΞ, zInL| = Yrs Info | -nE[ © mfo) (11.25) 


since x,, ...,x, are iid variables. Let T = T(x,,...,x,) bean estimator for 0. If T is unbi- 
ased for 0 then E(T) = 0, otherwise E(T) = 0 + b(0) where b(0) is the bias in using T to 
estimate 0. Writing it as an integral and then differentiating both sides with respect to 
0 and differentiating inside the integral sign, we have the following: 


| thax =0+ b0) - | TSpax=1+0") 
X X 00 


<r) Z r(5) E r| M nL. (11.26) 


Note that T does not contain @. Writing as expected values, we have 


[τ [2 mz rex = 14+b'(0) = 


ΠΕ) -1«b'(0). 


For any two real random variables u and v, we have the following results for covari- 
ances and correlations: 


Cov(u,v) = E| (u - E(u))(v - E(v))] 
= E[u(v - Ε(ν))] = E[v(u -- E(u))] 


because the second terms will be zeros due to the fact that for any random variable u, 
E[u -- E(u)] = 0. Further, since correlation, in absolute value, is < 1, we have 


[Cov(u, v « Var(u) Var(v) 


woe is also κα νας inequality. Since E[ 55 2 In L] = 0 from (11.24) and since 

E(T(Àj ag In L)] = Cov(T, 35 2 In L) we have the following σης which holds under the fol- 
lowing regularity conditions: (i) The support of f ία ,0) is free of 6; (ii) f(x Xj, 0) is differ- 
entiable with respect to 6; (iii) J, Sf (x; (x, 0)dx; ας (iv) Var( 3; Inf (xj, ϐ)) is finite and 
positive for all 0 € Q. 


Result 11.5 (Cramer-Rao inequality). For the quantities defined above, 


[cov(r, Z int) = [14 b' (0)]? < Var(T) var 2 11) 5 


[1 + b' (0)? _ 50r 
Var( 3; InL) nVar( 3; Inf (xj, 0)) 
ο ο ES ο δα 


Ξ = (11.27) 
Elg? nEls 2 Inf (x; ,0)? 


Var(T) = 


which gives a lower bound for the variance of the estimator T. 


334 — 11 Estimation 


Note that since x,,...,x,, are iid variables and since Ε[ 2 lnL] =OandE[ 2 f (Χ» θ)]-- 
0 we have 


var( ΓΕ mro o]. (11.28) 


The result in (11.27) is known as Cramer-Rao inequality. If T is unbiased for 0, then 
b(0) = 0 = b'(0) = 0 and then 


TR 1 
Ι(Θ) nl(0) 


Var(T) > (11.29) 


where I,(0) = Var( 3; 111) which has the various representations given in (11.28). 
This I,(0) is called Fisher's information about 0 in the sample of size n and Π(θ) is 
Fisher's information in one observation. Hence the bounds for variance of T, given 
in (11.27) and (11.29) are called information bound for the variance of an estimator. 
Since Var(T) and I,,(@) are inversely related, smaller the variance means larger the 
information content, which is also consistent with a layman's visualization of “infor- 
mation" about 0 that is contained in the sample. Larger the variance means smaller 
the information and smaller the variance means larger the information content be- 
cause in this case the estimator T is concentrated around 0 when the variance is 
small. 


Note 11.9. “Information” in “Information Theory” is different from Fisher's infor- 
mation given in (11.29). The information in Information Theory is a measure of lack 
of uncertainty in a given scheme of events and the corresponding probabilities. This 
lack of “information” is also called “entropy” and this is the measure appearing in 
Communication Theory, Engineering and Physics problems. Students who wish to 
know more about Information Theory may look into the book [14], a copy is avail- 
able in CMS library. 


We can also obtain an alternative representation for I,(0). Let us consider the 
equation E[ ὦ lIn L] = 0 and let us differentiate both sides with respect to 0. 


E[ Sint] -o 5 | (5 mna -0 
> | aola mo ]ax o 
> J. Z mL).« (2 mt) %} -0 


l 
>< 
Nom —— 
mms 
Q 
BII 
— 
3 
Ότι 
ή 
m 
+ 
mms 
5 
D 
- 
3 
m. 
Ner 
N 
m 
S 
Il 
o 


11.3 Methods of estimation —— 335 


This shows that 
d d 2 o? 
Var( IL) - ELS nz | E -E| £ inz] 
a : o? 
= nE| 36 Inf (xj, e) - -nE[ 77; Infós,6)]. (11.30) 


This may be a more convenient formula when computing Fisher's information. 


Example 11.15. Check whether the minimum variance bound or information bound is 
attained for the moment estimator of the parameter (1) 0 in an exponential population, 
(2) p in a binomial population. 


Solution 11.15. (1) The exponential density is given by 
1 x 
f(x, θ) = g 3, 0>0,x20 
and zero elsewhere. Hence 
X 
Inf(x, 0) = -1n60—- a 


a 


9 1 
> 3g I 05.8) πα + 22 


o? 1 2x; 
-g 0.8 - uuu 
Hence 
o? n 2nE(x) 
-nE| 5 nf(s.6)] 778 + ΓΕ J 
| n 2n0 
ee δι 
n V 


= zz nh = Iy(6). 


The moment estimator of 0 is x and 


Var(x) @ 1 
αρ KO 


Var(X) = 


Hence the information bound is attained or x is the minimum variance unbiased esti- 
mator (MVUE) for @ here. 
(2) For the binomial case the moment estimator of p is p = " and 


Var(x) np(i-p) _ p-p) 
n? n? n o` 


Var(p) = 


336 —— 11 Estimation 


The likelihood function for the Bernoulli population is 
L-p*ü-p)* x=0,1,.... 


[One can also take the binomial probability function as it is, taken as a sample of size 
1 from a binomial population, namely f(x) = (2)p* à - p)" *.] 


InL - xinp + (n - x) In(1- p) 


d η... Χ (=x) 
dp p 1-p 

2 - 
d InL- X (n - x) 


dp? j p ü-py 
2 
r| d in] - 92, EO - n 
p a-p) pa-p) 


>  Var(p)- Doy 


Hence p here is the MVUE for p. 


Exercises 11.3 


11.3.1. By using the method of moments estimate the parameters a and f in (1) type-1 
beta population with parameters (a, B); (2) type2 beta population with the parameters 


(a, p). 
11.3.2. Prove Lemmas 11.1 and 11.2. 
11.3.3. Prove Lemmas 11.3 and 11.4. 


11.3.4. Consider the populations (1) f(x) = ax* 1,0 <x <1, a > 0, and zero elsewhere; 
(2) f(x) = Ba - xP O<x<1, β > 0 and zero elsewhere. Construct two unbiased esti- 
mators each for the parameters a and β in (1) and (2). 


11.3.5. Show that the sample mean is unbiased for (1) À in a Poisson population 
Πο) = fet x =0,1,2,...,A > 0 and zero elsewhere; (2) 0 in the exponential popula- 
tion f,(x) = je "/ 9 x20,0»0andzero elsewhere. 


11.3.6. For 0 in a uniform population over [0,0] construct an estimate T = c,x, + 
Ο2Χ2 + C3X3, Where x;, X5, X; are iid, as uniform over [0,6], such that E(T) = 0. Find 
Οι» 62, C4 such that two unbiased estimators Τι and T, are obtained where T, = 2x, 
x = the sample mean. Compute Ε[Τι -- 0]? and E[T; -- θ]. Which is relatively more 
efficient? 


11.3.7. Consider a simple random sample of size n from a Laplace density or double 
exponential density 


11.4 Point estimation in the conditional space ---- 337 


1 
1) - ETA -00 < X < 00, —co < < oo. 


Evaluate the MLE of 0. Is it MVUE of 0? 


11.3.8. For the population in Exercise 11.2.6, let Τι be the moment estimator and T, be 
the MLE of 0. Check to see whether T, or T, is relatively more efficient. 


11.3.9. In Exercise 11.2.6, is the moment estimator (1) consistent, (2) sufficient, for 0? 


11.3.10. In Exercise 11.2.6, is the maximum likelihood estimator (1) consistent, (2) suf- 
ficient, for 0? 


11.3.11. Consider a uniform population over [a, b], b » a. Construct the moment esti- 
mators for a and b. 


11.3.12. In Exercise 11.3.4, is the moment estimator or MLE relatively more efficient for 
(1) a when b is known, (2) b when a is known. 


11.3.13. In Exercise 11.3.4, are the moment estimators sufficient for the parameters in 
situations (1) and (2) there? 


11.3.14. In Exercise 11.3.4, are the MLE sufficient for the parameters in situations (1) 
and (2) there? 


11.4 Point estimation in the conditional space 


11.4.1 Bayes' estimates 


So far, we have been considering one given or pre-selected population having one or 
more parameters which are fixed but unknown constants. We were trying to give point 
estimates based on a simple random sample of size n from this pre-selected popula- 
tion with fixed parameters. Now we consider the problem of estimating one variable 
by observing or preassigning another variable. There are different types of topics un- 
der this general procedure. General model building problems fall in this category of 
estimating or predicting one or more variables by observing or preassigning one or 
more other variables. We will start with a Bayesian type problem first. 

Usual Bayesian analysis in the simplest situation is stated in terms of one variable, 
and one parameter having its own distribution. Let x have a density/probability func- 
tion for a fixed value of the parameter 0. If 0 is likely to have its own distribution, then 
we denote the density of x as a conditional statement, f (x|0) or the density/probability 
function of x at a given 0. If 0 has a density/probability function of its own, denoted 
by g(0), then the joint density/probability function of x and 6, denoted by f(x,0), 15 
the following: 


f 06,0) =f (xl8)g(0). 


338 —— 11 Estimation 


We may use the following technical terms: g(0) as the prior density /probability func- 
tion of 0 and f(x|0) as the conditional density /probability function of x, at preassigned 
value of 0. Then the unconditional density/probability function of x, denoted by f, (x), 
is available by integrating out (or summing up in the discrete case) the variable 0 from 
the joint density/probability function. That is, 


MOSES | f(x|@)g(0)d@ when 0 is continuous (11.31) 
8 
= Σ f(x|@)g(0) when ϐ is discrete. (11.32) 
8 
Then the conditional density of 0 at given value of x, denoted by g(0|x), which may 


also be called the posterior density/probability function of 0 as opposed to prior den- 
sity/probability function of 0, is the following: 


«1.6.  f(x|0)g(0) 
fx) |μαιθεί(θαάθ 


g(8|x) (11.33) 


and replace the integral by the sum in the discrete case. 

If we are planning to estimate or predict 0 by using x or at a preassigned value of 
x, then from Lemma 11.2 we see that the “best” estimate or best predictor of 6, given 
x, is the conditional expectation of 6, given x. Here, we are asking the question: what 
is a very good estimate of 0 once x is observed or what is a good estimate of 0 in the 
presence of a preassigned value of x? Then from Lemma 11.2 we have the answer as 
the ‘best” estimator, best in the minimum mean square sense, is the conditional ex- 
pectation of 6, given x. This is the Bayes' estimate, given by 


E(0|x) = | 0g(0]x)dO 11015 continuous 
8 


- Σ θο(θ|χ) if @ is discrete. (11.34) 
[7 


Example 11.16. An experiment started with n = 20 rabbits. But rabbits die out one by 
one before the experiment is completed. Let x be the number that survived and let p 
be the true probability of survival for reach rabbit. Then x is a binomial random vari- 
able with parameters (p, n = 20). Suppose that in this particular experiment 15 rabbits 
survived at the completion of the experiment. This p need not be the same for all exper- 
imental rabbits. Suppose that p has a type-1 beta density with parameters (a = 3, p = 5). 
Compute the Bayes’ estimate of p in the light of the observation x = 15. 


Solution 11.16. According to our notation, 


f(xlp) = (λρα-»»», Xz0,1,...,n, O«p«1 


11.4 Point estimation in the conditional space —— 339 


and 


g(p) = reg i p, O<p<i1,a>0,B>0. 


The joint probability function 
fœ p) = f(xIp)g(p) 


Γία 4 p) ex-lq _ Brn-x-1 -0.1 
(A, {-) x 2 0,1,..., 


O<p<i,a>0, B>O0. 


= 


The unconditional probability function of x is given by 


Á : .[n I(a * p) kien _ y)btn-x-1 
RO= | resmap- (7) Tana E (a-p =dp 


- (0) Γία +p) T(a 4 x)I(B 4 n - x) 
Γ(α)Τ(β)  I(a-Ben ` 


Therefore, 
_fœ p) 
g(plx) = τω 
|. τα βατ) ο - 
- Ta -x)Fr(B«n- a -p) , O<p<1. 
Hence 


1 
E(p|x) = f pg(plx)dp 


I(a + β +n) α.χ-- *n-x- 
Τα «xf «n- 5 [ox - py "an 
7 I(a 1 Bn) TI(a -χ DI(B«-n-x) 
~ I(a «3I (B *n-x) T(a+B+n+1) 

_ atx 
~atpen 


This is the Bayes’ estimator as a function of x. Since we have a = 3, B=5,n=20 and x 
is observed as 15, the Bayes’ estimate of p, denoted by E[p|x = 15], is given by 


3415 9 
E[plx = 15] = ———— = —. 
[PI | 34554520 14 


The moment estimator and the Tecum likelihood estimator of p is p = — ou the 
COH espongiug estimate is D- 2. Bayes’ estimate here is slightly reduced from 2 5 to 


m 2 and the unconditional End value of p is 


a 3 


3 
αἲβ 3+5 8 


340 —— 11 Estimation 


Example 11.17. Let the conditional density of x given 0 be a gamma density of the 
type 
θα 
f(x) = — x*le 9, χ»0,α»0,0»0 
and the prior density of 0 be again a gamma of the form 


B 
g(0) = ην y»0,0»0, B>0. 


Compute the unconditional density of x. 


Solution 11.17. The joint density of x and 0 is given by 


p 
y -19a-*f-1 4-0 
f(x ) T ni i e 


for 0 > 0, x > 0, y > 0, a > O, B > 0. The unconditional density of x is given by 


fo - [^ foao 


p a-1 οο 

yx -lg- 

= 9**8-1e-96*y) dg 
T GT) | j 

Ξ ας yx* (x+y), x20, y>0,a>0, 20 

- Pas] αγ P 

- y" (1+ 2 


for x > 0, y > 0, a > 0, B» 0 and zero elsewhere. 


Note 11.10. The last expression for f, (x) above is also known as superstatistics in 
physics. For a = 1, y = = B+i= zu 43 1 it is Tsallis statistics, for q > 1, in physics 
in the area of non-extensive statistical mechanics. Lots of applications are there in 
various areas of physics and engineering. 


11.4.2 Estimation in the conditional space: model building 


Suppose that a farmer is watching the growth of his nutmeg tree from the time of ger- 
mination, growth being measured in terms of its height h and time t being measured 
in units of weeks. At t = 0, the seed germinated and the height, h = 0. When t = 1, af- 
ter one week, let the height be 10 cm. Then at t = 1, h = 10, height being measured in 
centimeters. The question is whether we can predict or estimate height h by observ- 
ing t so that will we be able to give a “good” estimated value of h at a preassigned 


11.4 Point estimation in the conditional space ---- 341 


value such as t = 100, that is, after 100 weeks what will be the height h? Let g(t) be an 
arbitrary function of t which is used to predict or estimate h. Then E|h — g(t)? is the 
square of a distance between h and g(t), E denoting the expected value. Suppose that 
this distance is minimized over all possible functions g and then come up with that g 
for which this distance is the minimum. Such an estimator can be called a good esti- 
mator of h. That is, min, E|h - g(t)|? 2 g =? ata preassigned value of t. This is already 
available from Lemma 11.2 and the answer is that g(t) = E[h|t] or it is the conditional 
expectation of h, given t, which is the *best" estimator of h, best in the sense of mini- 
mizing the expected squared error. This conditional expectation or the function g can 
be constructed in the following situations: (1) the joint density of ^ and t is available, 
(2) the joint density of h and t is not available but the conditional density of h, given f, 
is available. In general, the problems of this type is to estimate or predict a vari- 
able y at a preassigned value of x where x may contain one or more variables. 
The best estimator or the best predictor, best in the minimum mean square sense, is 
the conditional expectation of y given x. 


Example 11.18. Construct the best estimator of y at x = i if x and y have the following 
joint density: 


1 1 2 
fosy) = ——e 30? 39, -o9«y«oo, 0zx«1 
γ2π 


and zero elsewhere. 


Solution 11.18. Here, y can be easily integrated out. 


o CO 1 1 2 
x, y)dy = | ——e 30-79 gy =1 
N foc y)dy ess y 
from the total probability of a normal density with p = 2 + 3x and o? = 1. Hence the 
marginal density of x is uniform over [0,1]. That is, 


fü l O<x<1 


0, elsewhere. 


Then, naturally, the conditional density of y given x is normal N(y = 2 + 3x, 0? = 1). 
Therefore, the conditional expectation of y, given x, is E[y|x] = 2 + 3x, which is the 
best predictor or estimator of y at any preassigned value of x. Hence the best predicted 
value or the best estimated value of y at x = i is 2+ 3( 3) = 3. Problems of this type will 
be taken up later in the chapter on regression problems and hence this method will 
not be elaborated here. 


In Sections 11.3.1to 11.3.4, we examined point estimation procedures of estimating 
a parameter or parametric function, a fixed quantity, in a density/probability function 


342 —— 11 Estimation 


of x by taking observations on x. Then in Sections 11.4.1 and 114.2, we examined two 
situations of estimating one variable by observing or preassigning another variables 
or other variables. Now we will examine how to estimate a density function itself, after 
examining some more properties of estimators. 


11.4.3 Some properties of estimators 


Some interesting properties can be obtained in the conditional space, connecting the 
properties of unbiasedness and sufficiency of estimators. Let x,,...,x, be iid from the 
population designated by the density/probability function f(x, 6). Let g(0) be a func- 
tion of 0. Let u = u(x, ...,x,) be an unbiased estimator of g(0). Let T = T(x,,...,x,) be 
a sufficient statistic for 0. Let the conditional expectation of u given T be denoted by 
h(T), that is, h(T) = E[u|T]. From the unbiasedness, we have E[u] = g(0). Now, going 
to the conditional space with the help of Lemma 11.3 we have 


g(0) = E[u] -E[E(u T) -E[h(T)) =  E[h(T)]-g(0) (11.35) 


or, h(T) is also unbiased for σ(θ). Thus we see that if there exists a sufficient statistic 
T for 0 and if there exists an unbiased estimator u for a function of 0, namely, g(0), 
then the conditional expectation of u for given values of the sufficient statistic T is 
also unbiased for g(0). Now, we will obtain an interesting result on the variance of 
any unbiased estimator for g(0) and the variance of h(T), a function of a sufficient 
statistic for 0. From Lemma 11.4, we have 


Var(u) = E[u - (0)] = Var(E[u|T]) + E[Var(E(ulT))] 
=Var(h(T))+6, 620 


where 6 is the expected value of a variance, and variance of a real random vari- 
able, whether in the conditional space or in the unconditional space, is always 
non-negative. Hence what we have established is that the variance of an unbiased 
estimator for g(0), if there exists an unbiased estimator, is greater than or equal to the 
variance of a function of a sufficient statistic, if there exists a sufficient statistic for 0. 
That is, 


Var(u) > Var(h(T)) (11.36) 


where A(T) = E[u|T] is the conditional expectation of u given T, which is a function 
of a sufficient statistic for 0. The result in (11.36) is known as Rao- Blackwell theorem, 
named after C.R. Rao and David Blackwell who derived the inequality first. The beauty 
ofthe result is that if we are looking for the minimum variance bound for unbiased es- 
timators then we need to look only in the class of functions of sufficient statistics, if 
there exists a sufficient statistic for the parameter 0. We have seen earlier that if one 


11.4 Point estimation in the conditional space —— 343 


sufficient statistic T exists for a parameter 0 then there exist many sufficient statistics 
for the same 0. Let T, and T, be two sufficient statistics for 0. Let E[u|T;] = h;(T;), i 1,2. 
Should we take h,(T,) or h;(T;) if we are trying to improve the estimator u, in the sense 
of finding another unbiased estimator which has a smaller variance? Uniqueness for 
h(T) cannot be achieved unless the estimator T satisfies one more condition of com- 
pleteness. Note that 


εἰθ) = Elu) -E[h(T)] -E[h(T)] 9 E[hy(T,)—h,(T,)] = 0. (11.37) 


Definition 11.10 (Complete statistics). Let T be a statistic and let k(T) be an arbi- 
trary function of T. If E[k(T)] = 0 for all 0 in the parameter space © implies that 
k(T) = 0 with probability one then we say that T is a complete statistic for the pa- 
rameter 0. 


Observe that completeness is a property of the density/probability function of T 
and it tells more about the structure of the density/probability function. If T is a suf- 
ficient and complete statistic for 0, then E[u|T] = h(T) is unique. Thus, in a practi- 
cal situation, if we try to improve an unbiased estimator u for a parametric function 
g(0) then look for a complete sufficient statistic T, if there exists such a T, then take 
h(T) = E[u|T] which will give an improved estimator in the sense of having a smaller 
variance compared to the variance of u, and h(T) is unique here also. 


Example 11.19. Let χι, x; be iid as exponential with parameter 0. That is, with the 
density 


fo)= zei, 050, x20 


and zero elsewhere. Let u, = 0.6x, + 0.4x5, Uy = x, +X. Then we know that u, is unbi- 
ased for 0 and u, is a sufficient statistic for 0. Construct h(u,) = E[u,|u,] and show that 
it has smaller variance compared to the variance of u4. 


Solution 11.19. Let us transform χι, x; to u,, u;. Then 


u = 0.6x,+0.4x, and u,=xX%,+X% 2 


X,=-2u,+5u, and x,-3u,-5u, 


and the Jacobian is 5. Let the joint densities of χι, x; and u,, u, be denoted by f(x,, X2) 
and g(u,, u5), respectively. Then 


1 
f(x.x)) = me em, 0>0, x20, x20 
and zero elsewhere, and 


5 u; 
g(u,,u5) = σος 


344 ---- 11 Estimation 


and zero elsewhere, where jui < U < 3u and 0 « u, « oo, or žu «μις žu and 0 < 
12 < co. Since x, and x, are iid exponential, the sum u, is a gamma with parameters 
(a = 2, B = 0) or the density function of i5, denoted by f,(u,), is given by 


Up m 
fp) = gi? δ, 30,050 
and zero elsewhere. Hence the conditional density of u,, given u,, denoted by g(u, |u;), 
is available as 


glupu) 5 2 3 
uus) = ---, Cu «u < u. 
g(ujlu;) Euch d 5 2 <U < s 


Hence the conditional expectation of u,, given u,, is the following: 


5 4 
= | 2 u$ ii - πα 
2u,.25 ? 25 2 


Denoting this conditional expectation as h(u,) = 3 and treating it as a function of the 
random variable u, we have the variance of h(u,) as follows: 


Var(h(u;)) = ; Var(x, 1 x3) = AG +67) = 0.50°. 
But 
Var(u,) = Var(0.6x, + 0.4x5) = (0.6)?8? + (0.426 = 0.5267. 


This shows that h(u,) has a smaller variance compared to the variance of the unbiased 
estimator μη. This illustrates the Rao- Blackwell theorem. 


11.4.4 Some large sample properties of maximum likelihood estimators 


Here, we will examine a few results which will show that the maximum likelihood es- 
timator of a parameter 0 possesses some interesting large sample (as the sample size 
becomes larger and larger) properties. Rigorous proofs of these results are beyond the 
scope of this book. We will give an outline of the derivations. In the following deriva- 
tions, we will be using differentiability of the density/probability function, f(x, 0), dif- 
ferentiation with respect to a parameter 0 inside the integrals or summations, etc., 
and hence the procedures are not applicable when the support of f (x, 0) (where f (x, 0) 
is non-zero) depends on the parameter 0. These aspects should be kept in mind. The 
regularity conditions in Result 11.5 must hold. 


11.4 Point estimation in the conditional space —— 345 


Let x,,...,x, be iid with density/probability function f(x, 0). The joint density/ 
probability function, denoted by L,(X,0), X' = (x,,...,X,), a prime denoting a trans- 
pose, is given by 


n 
L,(X,9) =] ]/ία»θ). 
jel 
Then 
9 
38 InL,(X,0) 20 (11.38) 


is called the likelihood equation. Let a solution of the likelihood equation be denoted 
by 0. Let the true value of the parameter be denoted by 0o. Then from equation (11.24), 
we know that 


E| A θ) | E (11.39) 
90 7 6-6, 
From (11.38), we have 
E "ra 
f nL Χ.Θ) a5 ET .9| ο 
og πα ᾶ,θ > 2 ag HO) 
But from the weak law of large numbers (see Chapter 9), 
jd 9 
REA .9| ET .9| | 
n 296 nf (x; hac ag πο] Ίω 


as n — oo. But the right side expected value is already zero at the true parameter value 
0, by equation (11.24). Hence as n > oo, ὂ goes to the true parameter value 0, with 
probability one, or 


ΡΙ[0 - θρ| « ε} -»1 as n oo. 


This shows that Ó, a solution of the likelihood equation (11.38), is a consistent estima- 
tor of the true parameter value 6. 


Result 11.6. Consider a density/probability function f (x, 0) where the support does 
not depend on 0, and the regularity conditions of Result 11.5 hold, then the MLE for 
the parameter 0 is a consistent estimator for 0 when the sample size n > co. 


This means that Ó is in the neighborhood of the true parameter value 05. Let us 
expand 0 = 2 InL,(X,0)|; ϱ in the neighborhood ofthe true parameter value 6, to the 
second-order terms. Then 

9 


A o? 
0-2 —InL,(X, - 09) — InL,(X, 
59 η n(X 8) eee 90) 353 nL, (X,6) x 


346 —— 11 Estimation 


, (0-0)? οἱ 
In L,(X,0 11.40 
yo age niae]. (11.40) 


where |Ó — 0| < i8 — Ool. From (11.40), by multiplying both sides by yn and rearranging 
terms we have the following: 


: InL,(X,0 
vn(8 - 09) = πλ, le (11.41) 


2 6- 09) 
1 ἐπ InL,QG6)lg a, s A o) a ~ In 1X θ)ρ. 0, 


The second term in the denominator of (11.41) goes to zero because θ.υ θο as n 5 co 
and the third derivative of In L, (X, 0) is assumed to be bounded. Then the first term in 
the denominator is such that 


c 92 
Lm 2 np (x, 2m EU. Info, ol 


> - Var| infos, o]. 


by (11.30), which is the information bound Π(θρ). Hence 


1 ο” 


Zn, e|. > -I (0o) 


where Π(θρ) is assumed to be positive. Hence we may rewrite (11.41) as follows: 


“πι 
Inf(x;, 0 
JLO) D og If 05 Nee 


VI. (69) νπ(θ - 80) = 


where i Inf (x;, 0) has expected value zero and variance I, (05). Further, f (x;, 0) for j = 
1,...,n are iid variables. Hence by the central limit theorem 


yn 1 
S ,0) 2 N(0,1), > 
20:9 nf 05,6) > N(0,1, asn— oo 
where N(0, 1) is the standard normal, or we may write 
lane]. - N(0.1,(89)) 


which shows that the left side 
VI, (85) Vn - 69) > N(0,1). (1.414) 


Since Π(θῃ) is free of n, this is also the same as saying 


Ε 1 
vn(0 - θῃ) > n(o, LOJ ) (11.41b) 


which also shows that /n@ attains its minimum variance bound as n -» co or Ó is 
relatively most efficient for 0, when n co. Thus, we have the following result. 


11.4 Point estimation in the conditional space —— 347 


Result 11.7. When the regularity conditions of Result 11.5 hold, the MLE ὃ of the true 
parameter value 0, is at least asymptotically (as n — co) the most efficient and 


Vn(0 - 65) N(0, san! as n > co. 


Note 11.11. Equations (11.41a) and (11.41b) are very often misinterpreted in statis- 
tical literature. Hence the student must be very careful in using and interpreting 
(1141a) and (11.41b). Misinterpretation comes from assuming that Ó is approxi- 
mately normal for large values of the sample size n, in the light of (11.41b) or 
(11.41a), which is incorrect. When n becomes larger and larger the density/proba- 
bility function of @ may come closer and closer to a degenerate density and not to 
an approximate normal density. For each n, as well as when n > co, 0, may have its 
own distribution. For example, if 0 is the mean value in the exponential population 
then 6 = x, the sample mean, but x has a gamma distribution for all n, and not an 
approximate normal distribution. A certain weighted and relocated 9, as shown in 
(11.41a) and (11.41b), has approximate normal distribution as n becomes larger and 
larger, and finally when n > oo a normal distribution. 


Example 11.20. Illustrate the properties of the maximum likelihood estimator of 0 in 
the exponential population with density 


f(x) - ze, x20, 0>0 
and zero elsewhere. 


Solution 11.20. Let x, ...,x,, be iid with the density as above. Then the joint density 
of X' = (x,,...,x,), prime denoting the transpose, is given by 


1 1 
L,(X, 8) = a; exb|- 566 pee ex) 
and 
InL,(X,6) = -nIn6 - δα ES 
9 


30 InL,QX,09-20 > - 


+ EI Te +X,) =0 
> =x. 
Thus, Ó is a solution of the likelihood equation. We know from exponential population 
that E (X) = 0o and Var(x) = ο where 6, is the true parameter value, for all n. Hence 
vVn(0 -- 09) = V/n(X - 09). But by the central limit theorem 
χ-θο  wn(X-0,) 
War(x) θο 


> N(0,1) asn- oo. (a) 


348 — 11 Estimation 


This also shows that V(x — 05) -» N(0,02) as n — co since 6$ is free of n. Now, 


o? 9/1 X 
s; Inf 5.0) = “5 + a) 


(00.0 8 
1 2x 
(9 δ᾽ 
Hence 
o? 1 2E(x) 
-E| oo] gs + Tu 
1 
= @ = ÅL (0). (b) 
Therefore, we may write the result in (a) as 
A 1 
γπ(θ -- 0 "Νίο, ) as n > oo (c) 
9) I, (05) 


which illustrates the result on asymptotic (as n — co) efficiency and normality of the 
MLE of 0 here. 


Exercises 11.4 


11.4.1. Derive the Bayes’ estimator of the parameter A in a Poisson population ifA hasa 
prior (1) exponential distribution with known scale parameter, (2) gamma distribution 
with known scale and shape parameters. 


11.4.2. If the conditional density of x, given 0, is given by 
f 0X10) = ειχα 16631’ 


for a > 0, 0 > 0, y > O, ô > 0, x = 0 and zero elsewhere and 0 has a prior density of the 
form 


g(8) = c,0* e "^ 


fore» 0, n > 0, ô > 0,0» O and zero elsewhere, (1) evaluate the normalizing constants 
cı and c;, (2) evaluate the Bayes’ estimate of 0 if e, n and ὃ are known. 


11.4.3. Write down your answer in Exercise 11.4.2 for the following special cases: 
(1) 6=1; (2) 6 =1, a=1; (3) 6=1, a=1, € =1; (4) y=1, 6=1. 


11.4.4. Derive the best estimator, best in the minimum mean square sense, of y at 
preassigned values of x, if the joint density of x and y is given by the following: (1) 


11.5 Density estimation —— 3649 


and zero elsewhere; (2) 


fy) = 


6x? 
243 


and zero elsewhere. 


11.4.5. Evaluate the best estimator of y at (1) x = h (2x- i in Exercise 11.4.4. 


11.4.6. Evaluate the best predictor or the best estimator of y at preassigned values of 
x if the conditional density of y, given x, is the following: 


gl) = 


1 | 1 2 Ἰ 
ex 2-3x-5x , -οο«γ«ςοο 

uec (y ) y 
and evaluate the best estimate of y when (1) x = 0; (2) x - 1. 
11.4.7. Check for asymptotic (as n — co) unbiasedness, efficiency, normality and con- 
sistency of the maximum likelihood estimator of the parameter (1) Ain a Poisson popu- 
lation; (2) p in a Bernoulli population; (3) u in N (p, 0?) with o? known; (4) o? in N(, 0?) 
where μ is known; (5) a in a type-1 beta with β = 1; (6) B in a type-1 beta when a = 1. 
Assume that a simple random sample of size n is available in each case. 


11.4.8. Check for the asymptotic normality of a relocated and re-scaled MLE of 0 in 
a uniform population over [0,0], assuming that a simple random sample of size n is 
available. 


11.4.9. Is the MLE in Exercise 11.4.8 consistent for 0? 


11.4.10. Verify (a) Cramer-Rao inequality, (b) Rao- Blackwell theorem with reference 
to the MLE of the parameter (1) p in a Bernoulli population; (2) Ain a Poisson popula- 
tion; (3) 0 in an exponential population, by taking suitable sufficient statistics when- 
ever necessary. Assume that a simple random sample of size n is available. 


11.5 Density estimation 


Here, we will consider a few situations where one can uniquely determine a den- 
sity/probability function from some known characteristics. When such characteristic 
properties are not available, then we will try to estimate the density from data points. 


11.5.1 Unique determination of the density/probability function 


Suppose that for a real positive continuous scalar random variable x the density is 
unknown but its h-th moment is available as 


Γία +h) 


hy _ 
is rey res 


α»0,β»0 (11.42) 


350 — 11 Estimation 


and C is such that when h = 0 the right side is one. If (11.42) is available for an arbi- 
trary h, including complex values of h, then we know that a type-1 beta random vari- 
able has the h-th moment of the type in (11.42). We can identify the density of x as 


I(a 4 B) 
T(a)T(B) 
and zero elsewhere. From the structure of the moment if one cannot see the density 
right away, then one may go through the inverse Mellin transform formula 


feo- x*4-xf35, O<x<1,a>0, B>0 


f(x = xu |... [Ex")px^dn, i= v1 (11.43) 


2πὶ c-ico 

and c in the contour is such that c > -ᾱ. In general, if for some positive real scalar 
random variable, E(x") is available, for an arbitrary h, then we may go through the for- 
mula in (11.43) to obtain the density f(x). The conditions under which f(x) is uniquely 
determined are the conditions for the existence of the inverse Mellin transform. The 
discussion of the conditions is beyond the scope of this book. (Details are available 
in the book [2].) Hence a practical procedure is to search in the list of h-th moments 
of known variables and identify the variable if the h-th moment is in the class of 
moments known to you. 

If the characteristic function of a real scalar random variable is available, then we 
may go through the inverse Fourier transform and obtain the corresponding density 
function. If the Laplace transform of a positive real scalar random variable, M,(-t) 
where M,(t) is the moment generating function (mgf), is available then we may go 
through the inverse Laplace transform and obtain the density. 

Instead of the transform of the density, such as Mellin transform (arbitrary mo- 
ments for positive random variables), Laplace transform (mgf with t replaced by -t 
for positive random variables), Fourier transform (characteristic function) or other 
transforms, suppose that some properties of the random variable are available. If 
such properties are unique properties of some specific random variables, then from 
the properties one can reach the random variable through mathematical techniques 
of integral equations, functional equations, differential equations, algebraic manip- 
ulations, etc. This area is known as characterizations of distributions. (An insight 
into this area is available from the book [11].) If the properties are not characteristic 
properties, still we can come up with a class of functions having those properties, and 
thus we can narrow down the set of functions where the underlying density function 
belongs. These are some of the procedures for uniquely determining the densities 
from known characteristics. 


11.5.2 Estimation of densities 


Suppose that such characteristics as described in Section 11.5.1 are not available but 
only an observed sample (a set of numbers) is available. Can we identify or at least 
estimate the underlying density function, if there is one? Observe that infinitely many 


11.5 Density estimation —— 351 


distributions can give rise to the data at hand, and hence unique determination of the 
underlying density is not possible. This point should be kept in mind when looking at 
any method of density estimation from observations. 

One method is to take the sample distribution (cumulative relative frequencies) 
function as a representative of the population distribution function (cumulative prob- 
ability function) F(x). 


Definition 11.11 (Sample distribution function). Let x,,...,x, be the n observa- 
tions. Let 


number of observations less than or equal to x 
n 


Sn) = 


(11.44) 


for —co < x « oo. Then S,,(x) is called the sample distribution function or empirical 
distribution function based on n observations. 


Example 11.21. Construct the sample distribution function if the following is a set of 
observations from some population: —3, 2, 1,5. 


Solution 11.21. For -ο0 < x < —3, S,,(x) = 0 because there are no observations there. 
S4) = i at x = -3 and it remains the same until x = 1. Then S,,(x) = 7 atx-1andin 
the interval 1 < x « 2, and so on. That is, 


0, -co«x«-3 

1 

1) -3€x«1 
δα) 4%, 1sx<2 

3 

i 2<x<5 

1, x25 


Note that it is a step function. [The student is asked to draw the graph to see that the 
graph is looking like steps.] We will examine some basic properties of this S, (x). 

Let u = nS,(x) = the number of observations less than or equal to x. Let p be the 
true probability of finding an observation less than or equal to x. Then p - F(x) Ξ the 
population distribution function of the underlying population. Then u is distributed 
as a binomial random variables with parameters (p = F(x),n). Then from the binomial 
probability law 


E[nS,(x)] = E(u) = np = nF(x) 
= E[S,(x)] =p=F(x) (11.45) 


and 


Var(nS, (x)) = Var(u) = np(1 - p) = nF(x)(1- F(x)) 


352 — 11 Estimation 


Foo - Foo) 
T Tem 4 


>  Var(S,,(x)) = (11.46) 
Note that Var(S,(x)) -» 0 as n 5 oo and E[S,,(x)] = F(x) for all x and n. From the weak 
law of large numbers or from Chebyshev inequality, we have stochastic convergence 
of S, (x) to the true distribution function F(x) or 


Jim Pr{|S,(x) -F| « e] 21 


for e » 0, however small it may be. Thus, we can say that S, (x) is a representative ofthe 
true distribution function F(x). Then, when F(x) is differentiable, we have the density 
f(x), given by 
F(x + ὃ) - F(x) 

6 


fo) = lim 


and hence we may make the approximation 


nh F(x«h)-F(x-h) 


f(x) 2h 


Hence let 


S, hy) - Sc hy) 


11. 
ο (11.47) 


fn (x) = 


n 


where h, is any positive sequence of real numbers converging to zero. This f7 (x) can 
be taken as an estimate of the true density f(x). 


Exercises 11.5 


11.5.1. Determine the density of the non-negative random variable x where x has the 
h-th moment, for arbitrary h, of the form: 


_Td+h) 
~ T(2+h)’ 


(i) E(x") (ii) E(Qx")-rga-«mra- h). 


11.5.2. Determine the density of x if the characteristic function is f(t) = et, 
11.5.3. Determine the density of a non-negative random variable x ifthe Laplace trans- 
form of the density f(x) is L(t) 2 (1 0-5, 1+t>0. 


11.5.4. Let f(x) be the density of a continuous real scalar random variable x. Then 
Shannon’s entropy is given by S = -c f. f(x) Inf (x)dx, where c is a constant. By using 
calculus of variation, or otherwise, determine that f for which S is maximum, subject 
to the condition that E(x) = f. xf (x)dx = d < oo. 


11.5 Density estimation —— 353 


11.5.5. Someone is throwing a dart at a target on a plane board. Let the point of hit 
be (x, y) under a rectangular coordinate system on the board. Let the density function 
of (x, y) be f(x, y). Let the Euclidean distance of the point of hit from the origin of the 


rectangular coordinate system be r = ye + y?. Under the assumption that f (x, y) = g(r) 
where g is some unknown function, and assuming that x and y are independently dis- 
tributed, derive the densities of x and y and show that x and y are identically normally 
distributed. 


12 Interval estimation 


12.1 Introduction 


In Chapter 11, we looked into point estimation in the sense of giving single values 
or points as estimates for well-defined parameters in a pre-selected population den- 
sity/probability function. If p is the probability that someone contesting an election 
will win and if we give an estimate as p = 0.7, then we are saying that there is exactly 
7096 chance of winning. From a layman's point of view, such an exact number may 
not be that reasonable. If we say that the chance is between 60 and 75%, it may be 
more acceptable to a layman. If the waiting time in a queue at a check-out counter 
in a grocery store is exponentially distributed with expected waiting time 0 minutes, 
time being measured in minutes, and if we give an estimate of 0 as between 5 and 10 
minutes it may be more reasonable than giving a single number such as the expected 
waiting time is exactly 6 minutes. If we give an estimate of the expected life-time of 
individuals in a certain community of people as between 80 and 90 years, it may be 
more acceptable rather than saying that the expected life time exactly 83 years. Thus, 
when the unknown parameter 0 has a continuous parameter space Q it may be more 
reasonable to come up with an interval so that we can say that the unknown parameter 
0 is somewhere on this interval. We will examine such interval estimation problems 
here. 


12.2 Interval estimation problems 


In order to explain the various technical terms in this area, it is better to examine a 
simple problem and then define various terms appearing there, in the light of the il- 
lustrations. 


Example 12.1. Let x;, ..., x, be iid variables from an exponential population with den- 
sity 


f(,0) = ge", x>0,0>0 


and zero elsewhere. Compute the densities of (1) u=x, +--+ + x,; (2) v = 8 and then 
evaluate a and b such that Pr{a < v x b} = 0.95. 


Solution 12.1. The moment generating function (mgf) of x is known and it is M,(t) = 
(1-0t) 1,1—6t > 0. Since x, ...,x, are iid, the mgf ofu = x44 --- «x, is M,(t) = (1- 0t) ^, 
1- 0t > Ooru has 8 gamma distribution with parameters (a = n, f = 0). The mgf of v is 
available from M, (t) as M,(t) = (1- t) ", 1- t > 0. In other words, v has a gamma density 
with the parameters (a = n, B = 1) orit is free of all parameters since n is known. Let the 
density of v be denoted by g(v). Then all sorts of probability statements can be made 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. JBA] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https: //doi.org/10.1515/9783110562545-012 


356 —— 12 Interval estimation 


on the variable v. Suppose that we wish to find an a such that Pr{v < a} = 0.025 then 
we have 


a yr 
| e "dv - 0.025. 
o T(n) 


We can either integrate by parts or use incomplete gamma function tables to obtain 
the exact value of a since n is known. Similarly, we can find a b such that 


co ,,n-1 


e "dv = 0.025. 


Pr{x >b}=0.02 > | τ 
b 


This b is also available either integrating by parts or from the incomplete gamma func- 
tion tables. Then the probability coverage over the interval [a, b] is 0.95 or 


Pr{a xv x b} = 0.95. 


We are successful in finding a and b because the distribution of v is free of all pa- 
rameters. If the density of v contained some parameters, then we could not have found 
a and b because those points would have been functions of the parameters involved. 
Hence the success of our procedure depends upon finding a quantity such as v here, 
which is a function of the sample values x, ...,x,, and the parameter (or parameters) 
under consideration, but whose distribution is free of all parameters. Such quantities 
are called pivotal quantities. 


Definition 12.1 (Pivotal quantities). A function of the sample values x,, ...,x, and 
the parameters under consideration but whose distribution is free of all parameters 
is called a pivotal quantity. 


Let us examine Example 12.1 once again. We have a probability statement 
Pría xv x b} = 0.95. 


Let us examine the mathematical inequalities here. 


χι te +X, 
κω Me, 


1 0 1 
< < 
b (Gqe-e4x a 
0G X) sge OG tt Xn) 
b a 


Since these inequalities are mathematically identical, we must have the probability 
statements over these intervals identical. That is, 


?rla z (xX, + E t XQ) 2 b} i Pr (x, + : t XQ) «θ- (x, + 2 XQ) | (12) 


Thus, we have converted a probability statement over v into a probability statement 
over 0. What is the difference between these two probability statements? The first one 


a<v<b > 


=> 


12.2 Interval estimation problems ---- 357 


says that the probability that the random variable falls on the fixed interval [a, b] is 
0.95. In the second statement, 0 is not a random variable but a fixed but unknown 
parameter and the random variables are at the end points of the interval or here the 
interval is random, not 0. Hence the probability statement over 0 is to be interpreted 
as the probability for the random interval [Σ» zl covers the unknown @ is 0.95. 

In this example, we have cut off 0.025 area at the right tail and 0.025 area at the 
left tail so that the total area cut off is 0.025 + 0.025 = 0.05. If we had cut off an area 
5 each at both the tails then the total area cut off is a and the area in the middle if 
1- a. In our Example 12.1, a = 0.05 and 1 - a= 0.95. We will introduce some standard 
notations which will come in handy later on. 


Notation 12.1. Let y bea random variable whose density f (y) is free of all parame- 
ters. Then we can compute a point b such that from that point onward to the right 
the area cut off is a specified number, say a. Then this b is usually denoted as y, 
or the value of y from there onward to the right the area under the density curve or 
probability function is a or 


Pr{y x γαὶ =a. (12.2) 


Then from Notation 12.1 if a is a point below which of the left tail area is a then 
the point a should be denoted as y, , or the point from where onward to the right the 
area under the curve is 1 — a or the left tail area is a. In Example 12.1 if we wanted to 
compute a and b so that equal areas 5 is cut off at the right and left tails, then the first 
part of equation (12.1) could have been written as 


Priv; ενενε]-Ί-α. 


Definition 12.2 (Confidence intervals). Let x,,...,x, be a sample from the popula- 
tion f(x|0) where 0 is the parameter. Suppose that it is possible to construct two 
functions of the sample values $66, ...,x,) and $566, ..., x4) so that the probabil- 
ity for the random interval [$,, $5] covers the unknown parameter 0 is 1 -- a fora 
given a. That is, 


Pr($105,...,x,) 0€ $908, ...,x,)] 21- a 


for all 0 in the parameter space Q. Then 1 — a is called the confidence coefficient, 
the interval [φι, $,] is called a 100(1 — a)96 confidence interval for 0, φι is called the 
lower confidence limit, p, is called the upper confidence limit and $, -- $, the length 
of the confidence interval. 


When a random interval [Φι, φ2] is given we are placing 100(1-— a)96 confidence on 
our interval saying that this interval will cover the true parameter value 0 with proba- 
bility 1 — a. The meaning is that if we construct the same interval by using samples of 


358 — 12 Interval estimation 


the same size n then in the long run 100(1 - a)% of the intervals will contain the true 

parameter 0. If one interval is constructed, then that interval need not contain the true 

parameter 0, the chance that this interval contains the true parameter 6 is 1 - a. In our 

Example 12.1, we were placing 9596 confidence in the interval [Sous ως] to 

contain the unknown parameter 0. 

From Example 12.1 and the discussions above, it is clear that we will be successful 
in coming up with a 100(1 — a)% confidence interval for a given parameter 0 if we have 
the following: 

(i) A pivotal quantity Q, that is, a quantity containing the sample values and the pa- 
rameter 0 but whose distribution is free of all parameters. [Note that there may be 
many pivotal quantities in a given situation.] 

(ii) Q enables us to convert a probability statement on Q into a mathematically equiv- 
alent statement on 0. 


Vo.975 


How many such 100(1 -- a)96 confidence intervals can be constructed for a given 0, 
if one such interval can be constructed? The answer is: infinitely many. From our Ex- 
ample 12.1, it is seen that instead of cutting off 0.025 or in general 5 at both ends we 
could have cut off α at the right tail, or a at the left tail or any αι at the left tail and a, 
at the right tail so that αι + a, = a. In our example, v, < v < co would have produced 
an interval of infinite length. Such an interval may not be of much use because it is of 
infinite length, but our aim is to give an interval which covers the unknown 0 with a 
given confidence coefficient 1 — a, and if we say that an interval of infinite length will 
cover the unknown parameter then such a statement may not have much significance. 
Hence a very desirable property is that the expected length of the interval is as short 
as possible. 


Definition 12.3 (Centralintervals). Confidence intervals, obtained by cutting off 
equal areas 5 at both the tails of the distribution of the pivotal quantity so that we 
obtain a 100(1 — a)% confidence interval, are called central intervals. 


It can be shown that if the pivotal quantity has a symmetric distribution then the 
central interval is usually the shortest in expected value. Observe also that when the 
length, which is the upper confidence limit minus the lower confidence limit, is taken, 
it may be free of all variables. In this case, the length and the expected length are one 
and the same. 


12.3 Confidence interval for parameters in an exponential 
population 


We have already given one example for setting up confidence interval for the parame- 
ter 0 in the exponential population 


12.3 Confidence interval for parameters in an exponential population —— 359 


e$, x20,0»50 


Ff (x18) = 


DI =| 


and zero elsewhere. Our pivotal quantity was u = Oit ta) where u has a gamma dis- 


tribution with the parameters (a = n, B = 1) where nis the sample size, which is known. 
Hence there is no parameter and, therefore, probabilities can be read from incomplete 
gamma tables or can be obtained by integration by parts. Then a 100(1 - a)% confi- 
dence interval for 0 in an exponential population is given by 


| Who Ἔν OG ἘΞ. | =a 100(1 - a)% confidence interval 


us μι 5: 
where 
| 7? g(u)du = 5 
co a (12.3) 
| g(u)du = = 
ua 2 
2 
and 
n-1 
Ξ cu 20. 
g(u) TG? u> 


Example 12.2. Construct a 100(1 -- a)% confidence interval for the location parame- 
ter y in an exponential population, where the scale parameter 0 is known, say 0 - 1. 
Assume that a simple random sample of size n is available. 


Solution 12.2. The density function is given by 
fal - e €», xzy 


and zero elsewhere. Let us consider the MLE of y which is the smallest order statistic 
Xy4 7 yj. Then the density of y, is available as 


αν) = -Š [Pris > 21)" 


2Ξγι 
= ne οι), γι > y 
and zero elsewhere. Let u = y, - y. Then u has the density, denoted by gi(u), as follows: 


g,(uj=ne™, 130 


and zero elsewhere. Then we can read off Us and Us for any given a from this density. 
That is, 


Lo o a -nu a 
| ἐμοί]. => 1-6.-τ1-- 
0 2 


360 ---- 12 Interval estimation 


> Was Al n(1 ) (a) 
n 
L neudu= > olg pL 
ua 2 2 
2 
> us ---in(). (b) 
2 n 2 


Now, we have the probability statement 
Pr(u, « κγι-γς Us} -1-a. 
That is, 
Priy; — us SY SY, —-U_s}=1-a. 
Hence a 100(1 -- a)% confidence interval for y is given by 
[yı cue Yi cel. (124) 
For example, for an observed sample 2,8,5 of size 3, a 9596 confidence interval for 


gamma is given by the following: 


4-005 > 5 = 0.025, 


πα. m(4) --In(o.025). 
n 2 3 
1 


us zs 1n(0.975). 
An observed value of y, = 2. Hence a 95% confidence interval for y is [2 + 3 1n(0.025), 


2+ $1n(0.975)]. 


Note 12.1. If both scale parameter 0 and location parameter y are present, then we 
need simultaneous confidence intervals or a confidence region for the point (0, y). 
Confidence region will be considered later. 


Note 12.2. In Example 12.2, we have taken the pivotal quantity as the smallest order 
statistic y, = x,4. We could have constructed confidence interval by using a single 
observation or sum of observations or the sample mean. 


12.4 Confidence interval for the parameters in a uniform density 
Consider x,,...,X,, iid from a one parameter uniform density 
foo) = 5 ο-κ«θ 


and zero elsewhere. Let us construct a 100(1 — a)96 confidence interval for 0. Assume 
that a simple random sample of size n is available. The largest order statistic seems to 


12.5 Confidence intervals in discrete distributions —— 361 


be a convenient starting point since itis the MLE of 0. Let y, = x,.,, be the largest order 
statistic. Then y, has the density 
(v, = S [Pri <zi]"| - P ymt, Ο«γς«θ 
SYn - dz j^ πα. > Sys SU. 
Let us take the pivotal quantity as u = a, The density of u, denoted by g; (u) is given 
by 
-1 


g, (u) = nu” O<u<1 


and zero elsewhere. Hence 


ua τ α απ 
| ^mit^idu-- > u.s - [2] 
0 2 2 


2 
and 
4. 
"ak 
| nu du=- = us = |1- | 
ua 2 2 
2 
Therefore, 


Hence a 100(1 - a)% confidence interval for 0 in this case is 


ERES (125) 
4- 5) ὦ (5) τ 
For example, for an observed sample 8,2,5 from this one parameter uniform popula- 


tion a 9096 confidence interval for 0 is given by [το re c 
(0.95)3 (0.05)3 


Note 12.3. If the uniform population is over [a, b], b » a, then by using the largest 
and smallest order statistics one can construct confidence intervals for b when a is 
known, for a when b is known. Simultaneous intervals for a and b will be discussed 
later. 


12.5 Confidence intervals in discrete distributions 


Here, we will consider a general procedure of setting up confidence intervals for the 
Bernoulli parameter p and the Poisson parameter A. In discrete cases, such as a bi- 
nomial, cutting off tail probability equal to 5 each may not be possible because the 


362 —— 12 Interval estimation 


probability masses are at individually distinct points. When we add up the tail prob- 
abilities we may not get exact values m for example, 0.025. When we add up a few 
points, the sum of the probabilities may be less than 0.025 and when we add up the 
next probability the total may exceed 0.025. Hence in discrete situations we take the 
tail probabilities as s so that the middle probability will be 21 — a. Take the nearest 
point so that the tail probability is closest to 5 but less than or equal to $. 


12.5.1 Confidence interval for the Bernoulli parameter p 


We can set up confidence intervals for the Bernoulli parameter p by taking n observa- 
tions from a Bernoulli population or one observation from a binomial population. The 
binomial population has the probability function 


f(x p)= (Ἴρα-»»» O<p<1,x=0,1,...,n 


and zero elsewhere. We can assume n to be known. We will see that we cannot find 
a pivotal quantity Q so that the probability function of Q is free of p. For a binomial 
random variable x, we can make a statement 


Prix <x, a} <5, (12.6) 


NIR 


that is, the left tail probability is less than or equal to 5 for any given a if p is known. 
But since x is not a pivotal quantity Xi.« will be a function of p, that is χι « (p). For 
a given p, we can compute Xi. for any given a. For a given p, we can compute two 
points x,(p) and x,(p) such that 


Pr{x,(p) <x <x,(p)}>1-a (a) 
or we can select x; (p) and x,(p), for a given p, such that 


Pr{x «xi(p)] < (b) 


NIR 


and 


Pr{x 2 x(p)] < =. (c) 


NIR 


For every given p, the points x; (p) and x;(p) are available. If we plot x = x; (p) and x = 
x,(p), against p then we may get the graphs as shown in Figure 12.1. Let the observed 
value of x be xo. If the line x = x; cuts the bands x,(p) and x;(p), then the inverse 
images will be p, and p, as shown in Figure 12.1. The cut on x, (p) will give p; and that 
on x5(p) will give p, or a 100(1 -- a)96 confidence interval for p is [p,, p2]. Note that the 


12.5 Confidence intervals in discrete distributions —— 363 


region below the line x = x, is characterized by the probability 5 and similarly the 
region above the line x = χρ is characterized by 2. Hence the practical procedure is 
the following: Consider equation (b) with x, (p) = xy and search through the binomial 
table for a p, then the solution in (b) will give p,. Take equation (c) with x,(p) = xy and 
search through the binomial tables for a p, then the solution in (c) gives p4. 


Figure 12.1: Lower and upper confidence bands. 


Note that in some situations the line x - xy may not cut one or both of the curves x,(p) 
and x,(p). We may have situations where p, and p, cannot be found or p, may be 0 or 
pı may be 1. 

Let the observed value of x be xo, for example suppose that we observed 3 suc- 
cesses in n = 10 trials. Then our x, = 3. We can take x; (p) = Xii (p) = Xo and search for 
that p, say p), which will satisfy the inequality 


$n . a 
1-p** < =. 12.7 
2. MET pme (12.7) 
This will give one value of p, namely, p; for which (12.6) holds. Now consider the upper 
tail probability. Consider the inequality 
Prix 2x«(p)) < 7 (12.8) 


Again let us take x,(p) = Xa — Xo and search for p for which (12.8) holds. Call it p,. That 
is, 


n 
n a 
T (Ds αρ)” ὃς x (12.9) 
X=Xo 
Then 
Pr{p, <p<p2}<1-a (12.10) 


is the required 100(1 -- a)% confidence interval for p. 


364 —— 12 Interval estimation 


Example 12.3. If 10 Bernoulli trials gave 3 successes, compute a 95% confidence in- 
terval for the probability of success p. Note that for the same p both (12.7) and (12.9) 
cannot hold simultaneously. 


Solution 12.3. Consider the inequality 
3 


Σ Ἐν pi - py)? < 0.025. 


x=0 
Look through a binomial table for n = 10 and all values of p. From tables, we see that 
for p = 0.5 the sum is 0.1710 which indicates that the value of p, is bigger than 0.5. 
Most of the tables are given only for p up to 0.5. The reason being that for p > 0.5 we 
can still use the same table. By putting y = n - x and writing 


3 


0 3 
Σ (ἡ )r'a —pyo*=) (7 ara -qy 


x=0 x=0 
10 
= b ee @ (1-q)'° < 0.025 
y=7\ Y 
where q = 1 -- p. Now looking through the binomial tables we see that q = 0.4. Hence 
p =1-q=0.6. Now we consider the inequality 


10 


b τ. p%(1 - p\)®™* < 0.025, 


x-3 
which is the same as saying 


2 


>: pl pi - p,)?* > 0.975. 


x=0 
Now, looking through the binomial table for n = 10 and all p we see that p, = 0.05. 
Hence the required 95% confidence interval for p is [p,, p>] = [0.05, 0.60]. We have 95% 
confidence on this interval. 


Note 12.4. We can use this exact procedure of this section to construct confidence 
interval for the parameter 0 of a one-parameter distribution whether we have a piv- 
otal quantity or not. Take any convenient statistic T for which the distribution can 
be derived. This distribution will contain 0. Let Το be the observed value of T. Con- 
sider the inequalities 


Pr(T < To} < - (a) 
and 


Pr(T > To} < = (b) 


12.6 Confidence intervals for parameters in N(u, o?) — 365 


If the inequalities have solutions, note that both cannot be satisfied by the same 
0 value, then the solution of (a) gives 0; and the solution of (b) gives 0; and then 
[0,, 05] is a 100(1 — a)96 confidence interval for 0. As an exercise, the student is ad- 
vised to use this exact procedure to construct confidence interval for 0 in an expo- 
nential population. Use the sample sum as T. 


This exact procedure can be adopted for getting confidence intervals for the Pois- 
son parameter A. In this case, make use of the property that the sample sum is again 
a Poisson with the parameter nÀ. This is left as an exercise to the student. 


Exercises 12.2—12.5 


12.5.1. Construct a 9596 confidence interval for the location parameter y in an expo- 
nential population in Example 12.2 by using (1) x the sample mean of a sample of size 
n; (2) the sample sum for a sample of size 2; (3) one observation from the population. 


12.5.2. By using the observed sample 3,8,4,5 from an exponential population, 
1-0 
f(xl8, y) = 9* y, x2y, 050 


and zero elsewhere, construct a 9596 confidence interval for (1): 0 if y = 2; (2): y if 0 = 4. 


12.5.3. Consider a uniform population over [a, b], b » a. Assume that the observed 
sample 2,8,3 is available from this population. Construct a 9596 confidence interval 
for (1) a when b = 8; (2) b when a - 1, by using order statistics. 


12.5.4. Consider the same uniform population in Exercise 12.5.3 with a = 0. Assume 
that a sample of size 2 is available. (1) Compute the density of the sample sum y - 
X4 +X; (2) by using y construct a 95% confidence interval for b if the observed sample 
is 2,6. 


12.5.5. Construct a 9096 confidence interval for the Bernoulli parameter p if 2 suc- 
cesses are obtained in (1) 10 trials; (2) eight trials. 


12.5.6. Consider a Poisson population with parameter A. Construct a 9096 confidence 
interval for A if 3,7, 4 is an observed sample. 


12.6 Confidence intervals for parameters in N (y, o?) 


First, we will consider a simple problem of constructing a confidence interval for the 
mean value μ in a normal population when the population variance is known. Then 
we will consider intervals for μ when σ΄ is not known. Then we will look at inter- 
vals for σ’. In the following situations, we will be constructing the central intervals 


366 —— 12 Interval estimation 


for convenience. These central intervals will be the shortest when the pivotal quan- 
tities have symmetric distributions. In the case of confidence intervals for the popu- 
lation variance, the pivotal quantity taken is a chi-square variable, which does not 
have a symmetric distribution, and hence the central interval cannot be expected to 
be the shortest, but for convenience we will consider the central intervals in all situa- 
tions. 


12.6.1 Confidence intervals for p 


Case 1 (Population variance σ΄ is known). Here, we can take a pivotal quantity as the 
standardized sample mean 


Zim xe ~ N(0,1) 


which is free of all parameters when ø is known. Hence we can read off za and z,. « so 
2 2 
that 


<Z< —-1- 
Príz, « SZ &z«j 1-a. 


Since a standard normal density is symmetric at z = 0, we have ἄν α — Za. Let us 
examine the mathematical inequalities. 


-Za XZ&XZa > za xoc) X Za 
2 2 2 σ 2 
> Za E <X-USZa s 

24/n T; 2 γη 

=> X-Za £ SUSX+Za d 

Σγπ — z 2 4n 


and hence 


z σ - σ 
Pr{-2y sz «zy -Pr[t -z4 T spsktz | 


-1-a. 


Hence a 100(1 - a)96 confidence interval for y, when o is known, is given by 


k za σα ασ. (12.11) 


σ 
Ral 


The following Figure 12.2 gives an illustration of the construction of the central confi- 
dence interval for u in a normal population with o? known. 


Example 12.4. Construct a 95% confidence interval for u in a N(j, σ΄ = 4) from the 
following observed sample: —5,0, 2,15. 


12.6 Confidence intervals for parameters in N(u, 0°) — 367 


Figure 12.2: Confidence interval for p in a N(u,o?), 
σ2 known. 


Solution 12.4. Here, the sample mean x = (-5 +0 + 2 + 15)/4 = 3.1- a = 0.95 means 


$ = 0.025. From a standard normal table, we have 2p o»; = 1.96 approximately. o? is 


given to be 4, and hence o - 2. Therefore, from (12.6), one 9596 confidence interval for 
pis given by 


" σ . σ 2 2 
X ΤΉΝΕ 1.96(5 ).3+196(5)] 


= [1.04, 4.96]. 


We have 95% confidence that the unknown μ is on this interval. 


Note that the length of the interval in this case is 


[x +z X —Za 
2 2 


o | | σ | -2z σ 

vn vn] ^? vn 

which is free of all variables, and hence it is equal to its expected value, or the expected 
length of the interval in this case is 2z: Um = 2(1.96) = 3.92 in Example 12.4. 

Example 12.5. For a binomial random variable x, it is known that for large n (n > 20, 
np 2 5, n(1 -- p) = 5) the standardized binomial variable is approximately a standard 
normal. By using this approximation set up an approximate 100(1 — a)% confidence 
interval for p the probability of success. 


Solution 12.5. We will construct a central interval. We have 
x-np 


vnp( - p) 


From a standard normal table, we can obtain Z« 50 that an approximate probability is 
the following: 


=Z, Z-N(0,1). 


Pr{-z < ATP sz} =1-0 


ΠΕΙΣ 


The inequality can be written as 


(x - np? A 
np(1- p) 


ΠΠ 


Opening this up as a quadratic equation in p, when the equality holds, and then solv- 
ing for p, one has 


368 ---- 12 Interval estimation 


(x + 124) F (x+ 423)? - x20 4 1z2) 
2 2 2 * n 2 


p- (12.12) 


1 
n(1+ n72) 
These two roots are the lower and upper 100(1 - a)96 central confidence limits for p 
approximately. For example, for n = 20, x = 8, a = 0.05 we have Zo o5; = 1.96. Substitut- 
ing these values in (12.12) we obtain the approximate roots as 0.22 and 0.61. Hence an 
approximate 9596 central confidence interval for the binomial parameter p in this case 
is [0.22, 0.61]. [Simplifications of the computations are left to the student.] 


Case 2 (Confidence intervals for u when 0? is unknown). In this case, we cannot take 
the standardized normal variable as our pivotal quantity because, even though the 
distribution of the standardized normal is free of all parameters, we have a σ present 
in the standardized variable, which acts as a nuisance parameter here. 


Definition 12.4 (Nuisance parameters). These are parameters which are not rele- 
vant for the problem under consideration but which are going to be present in the 
computations. 


Hence our aim is to come up with a pivotal quantity involving the sample values 
and u only and whose distribution is free of all parameters. We have such a quantity 
here, which is the Student-t variable. Consider the following pivotal quantity, which 
has a Student-t distribution: 

E n zy 
MC MN e ΣΡΙ} (12.13) 
51 1-1 
where s; is an unbiased estimator for the population variance o°. Note that a Student-t 
distribution is symmetric around t = 0. Hence we can expect the central interval being 
the shortest interval in expected value. For constructing a central 100(1 — a)96 confi- 
dence interval for u read off the upper tail point LE such that 


a 
Prít, 4 = ERES = 3! 
Then we can make the probability statement 
Ρε ια < tna Staus) τα. (12.14) 


Substituting for t, | and converting the inequalities into inequalities over u, we have 
the following: 


Prix - tas SL eusxttyas A| -1-a (12.15) 


which gives a central 100(1 — a)% confidence interval for u. Figure 12.3 gives the illus- 
tration of the percentage points. 


12.6 Confidence intervals for parameters in N(u,07) —- 369 


Figure 12.3: Percentage points from a Student-t density. 


The interval is of length 2tn-1,% E , Which contains the variable s,, and hence it is a ran- 
dom quantity. We can compute the expected value of this length by using the fact that 


(n — 1)s] 
EU E χι 


where x2 , is a chi-square variable with (n — 1) degrees of freedom. 


Example 12.6. Construct a 9996 confidence interval for u in a normal population with 
unknown variance, by using the observed sample 1,0,5 from this normal population. 


Solution 12.6. The sample mean x = (1 + 0 + 5)/3 = 2. An observed value of s? is given 
by 


s? = [0 -2+ (0-27 + 6-27] =7 
= s = V7 =2.6457513. 


Now, our a = 0.015 5 = 0.005. From a Student-t table for n - 1 = 2 degrees of freedom, 
D 9.905 = 9.925. Hence a 99% central confidence interval for j here is given by 


2— 9.5257. + 9.9257 = [-13.16, 17.16]. 


ΝΒ E 


Note 12.5. In some books, the students may find the statement that when the sam- 
ple size n = 30 one can get a good normal approximation for Student-t, and hence 
take z, from a standard normal table instead of ένα from the Student-t table with 
V degrees of freedom, for v > 30. The student may look into the exact percentage 
points from the Student-t table to see that even for the degrees of freedom v = 120 
the upper tail areas of the standard normal and Student-t do not agree with each 
other. Hence taking z, instead of f, , for v > 30 is not a proper procedure. 


12.6.2 Confidence intervals for σ2 in N(p, o?) 


Here, we can consider two situations. (1) u is known, (2) u is not known, and we wish 
to construct confidence intervals for o? in N(j, 07). Convenient pivotal quantities are 
the following: When μ is known we can use 


370 —— 12 Interval estimation 


n 


»- 2 n SE 2 
SEM and Σο ας 


jel o? ΕΙ σ2 
Then from a chi-square density we have 
n 2 
(x; -μ) 
2 j 2 1. 
Pr LN 3 2 c? SYns[71-a (12.16) 
and 
n 22 
(x; - X) 
2 j 2 E 
Pr ως 32 ο Xmas [51 - ᾱ- (12.17) 


Figure 12.4: Percentage points from a chi-square density. 


Note that (12.16) can be rewritten as 


λος - JY Σα) — py 
Prd ma «o? « 27 |Ξι 
Xn, Xna-$ 


A similar probability statement can be obtained by rewriting (12.17). Therefore, a 
100(1 -- a)% central confidence interval for o? is the following: 


p -W° Σια) ] | τα} =)? Σμιθ -0 | (1248) 


2 i 2 2 ? 2 
Xn, Χπι-: Xn-1,% Xn-11-$ 


Note that a Y? distribution is not symmetric and hence we cannot expect to get the 
shortest interval by taking the central intervals. The central intervals are taken only 

n .— Y 
for convenience. When y is unknown, then we cannot use REL 


nuisance parameter y is present. We can use the pivotal quantity 


~ X2 because the 


and construct a 100(1 — a)% central confidence interval, and it is the second one given 
in (12.18). When µ is known, we can also use the standardized normal 


vi ~ N(0,1) 


as a pivotal quantity to construct confidence interval for c, thereby the confidence 
interval for o°. Note that if [Τι, ΤΟ] is a 100(1 -- a)% confidence interval for 0 then 


12.6 Confidence intervals for parameters in N(u, 0°) — 371 


[g(T4), £(T;)] 15 a 100(1 — a)96 confidence interval for g(0) when 0 to g(0) is a one to 
one function. 


Example 12.7. If —2,1,7 is an observed sample from a N (p, 0?), construct a 9596 per- 
cent confidence interval for o? when (1) u = 1, (2) wis unknown. 


Solution 12.7. x - £45? - 2,32 (x; — x)? = (-2-27? + (1-2)? + (7-2)? = 42. YP 0G - 
)? =(-2-1)!+ (1-1)? + (7-1)? = 45. 1- a = 0.95 = £ = 0.025. From a chi-square table 

H 2 

Xs = X3,0.025 = 9.35, Xnas = X30.025 = 7.38, Xna-$ = X507; = 0.216, Χαμ α = X30.975 = 

0.0506. (2) Then when μ is unknown a 95% central confidence interval for o? is given 


by 


Σρι -3» ea 42 42 | 
Xi- Xiaa-i 7.38' 0.0506 


= [5.69, 830.04]. 


(1) When p = 1, we can use the above interval as well as the following interval: 


De 05 - 1 za 45 45 
χε Xuei 935' 0.216 


= [4.81, 208.33]. 


Note that when the information about μ is used the interval is shorter. 


Note 12.6. The student may be wondering whether it is possible to construct con- 
fidence intervals for c, once confidence interval for o? is established. Then take the 
corresponding square roots. If [$406, ..., Xn), 9508, ...,x4,)] is a 100(1 - a)% confi- 
dence interval for 0, then [h($4), h($,)] is a 100(1— a)96 confidence interval for h(@) 
as long as 0 to h(0) is a one to one function. 


Exercises 12.6 


12.6.1. Consider a 100(1 -- a)% confidence interval for u in a N(j,0?) where o? is 
known, by using the standardized sample mean. Construct the interval so that the left 
tail area left out is a, and the right tail area left out is a, so that a, + a; =a. Show that 
the length of the interval is shortest when αι = a; = 5. 


12.6.2. Let x,, ...,x, be iid as N (p, 0?) where o? is known. Construct a 100(1 -- a)96 cen- 
tral confidence interval for u by using the statistic c,x, + --- + c,x, where c,,..., c, are 
known constants. Illustrate the result for c, = 2, c; = --3, c4 = 5 and based on the ob- 
served sample 2,1, -5. 


372 — 12 Interval estimation 


12.6.3. Construct (1) a 90%, (2) a 95%, (3) a 9996 central confidence interval for µ in 
Exercise 12.6.1 with o? = 2 and based on the observed sample --1,2,5, 7. 


12.6.4. Do the same Exercise 12.6.3 if o? is unknown. 


12.6.5. Compute the expected length in the central interval for the parameter p in a 
N(u, 0?), where o? is unknown, and based on a Student-t statistic. 


12.6.6. Compute the expected length as in Exercise 12.6.5 if the intervalis obtained by 
cutting off the areas a, at the left tail and a, at the right tail. Show that the expected 
length is least when a, = a). 


12.6.7. Construct a 95% central confidence interval for u in a Ν(μ, 0?), when o? is un- 
known, by using the statistic u = 2x, + x; — 5x3, and based on the observed sample 
5, 2,6. 


12.6.8. By using the standard normal approximation for a standardized binomial vari- 
able construct a 9096 confidence interval (central) for p the probability of success if 
(1) 7 successes are obtained in 20 trials; (2) 12 successes are obtained in 22 trials. 


12.6.9. The grades obtained by students in a statistics course are assumed to be nor- 
mally distributed with mean value p and variance σ΄. Construct a 9596 confidence in- 
terval for o? when (1) u = 80, (2) u is unknown, based on the following observed sam- 
ple: 75, 85, 90,90; (a) Consider central intervals, (b) Consider cutting off 0.5 at the right 
tail. 


12.6.10. Show that for the problem of constructing confidence interval for σ΄ in a 
N(u, 07), based on a pivotal quantity having a chi-square distribution, the central in- 
tervalis not the shortest in expected length when the degrees of freedom is small. 


12.7 Confidence intervals for linear functions of mean values 


Here, we are mainly interested in situations of the following types: (1) A new drug is 
administered to lower blood pressure in human beings. A random sample of n individ- 
uals is taken. Let x; be the blood pressure before administering the drug and y; be the 
blood pressure after administering the drug on the j-th individual, for j = 1,..., n. Then 
we have paired values (x;,y;), j = 1, ...,n. Our aim may be to estimate the expected 
difference, namely μ; - μι, p; = E(yj), pı = E) and test a hypothesis that (x; y;), 
j=1,...,n are identically distributed. But obviously, y = the blood pressure after ad- 
ministering the drug depends on x - the blood pressure before administering the drug. 
Here, x and y are dependent variables and may have a joint distribution. (2) A sample 
of n; test plots are planted with corn variety 1 and a sample of n, test plots are planted 
with corn variety 2. Let xj, ... , x, be the observations on the yield x of corn variety 1 and 
let y;,...,Yn, be the observations on the yield y of corn variety 2. Let the test plots be 


12.7 Confidence intervals for linear functions of mean values — 373 


homogeneous in all respects. Let E(x) = μι and E(y) = uj. Someone may have a claim 
that the expected yield of variety 2 is 3 times that of variety 1. Then our aim may be 
to estimate μι; — 314. If someone has the claim that variety 2 is better than variety 1, 
then our aim may be to estimate μι; — μι. In this example, without loss of generality, 
we may assume that x and y are independently distributed. (3) A random sample of 
n, students of the same background are subjected to method 1 of teaching (consisting 
of lectures followed by one final examination), and a random sample of n, students 
of the same background, as of the first set of students, are subjected to method 2 of 
teaching (may be consisting of each lecture followed by problem sessions and three 
cumulative tests). Our aim may be to claim that method 2 is superior to method 1. Let 
Ho = E(y), μι = E(x) where x and y represent the grades under method 1 and method 2, 
respectively. Then we may want to estimate u, — μι. Here also, it can be assumed that 
x and y are independently distributed. (3) Suppose that a farmer has planted 5 vari- 
eties of paddy (rice). Let the yield per test plot of the 5 varieties be denoted by x,, ... , X5 
with u; = E(x;), i 2 1,...,5. The market prices of these varieties are respectively Rs 20, 
Rs 25, Rs 30, Rs 32, Rs 38 per kilogram. Then the farmer's interest may be to estimate 
the money value, that is, 204; + 25μ; + 30μη + 32μ, + 38μ-. Variety i may be planted in 
n; test plots so that the yields are x;;,j=1,...,n;,i=1,...,5, where xj is the yield of the 
j-th test plot under variety i. 

Problems of the above types are of interest in this section. We will consider only 
situations involving two variables. The procedure is exactly parallel when more vari- 
ables are involved. In the two variables case also, we will look at situations where 
the variables are dependent in the sense of having a joint distribution, and situations 
where the variables are assumed to be statistically independently distributed in the 
sense of holding product probability property will be considered later. 


12.7.1 Confidence intervals for mean values when the variables are dependent 


When we have paired variables (x, y), where x and y are dependent, then one way of 
handling the situation is to consider u = y - x, in situations such as blood pressure 
before administering the drug (x) and blood pressure after administering the drug (y), 
if we wish to estimate u - μι Ε(Υ) - E(x). If we wish to estimate a linear function 
a} + bp, then consider the function u = ax + by. For example, a = -1 and b = 1 gives 
H2 -µι. When (x, y) has a bivariate normal distribution then it can be proved that ev- 
ery linear function is univariate normal. That means, u ~ N (ñ, 6?) where ji = αμι + buy 
and 6? = a?o? + b?o$ + 2abCov(x,y), of = Var(x), σὲ = Var(y). Now, construct confi- 
dence intervals for the mean value of u in situations where (1) Var(u) is known, (2) 
Var(u) is unknown, and confidence intervals for Var(u) for the cases when (1) E(u) is 
known, (2) E(u) is unknown, by using the procedures in Section 12.5. Note that we 
need not know about the individual parameters 14,115, 07, 0$ and Cov(x, y) in this pro- 
cedure. 


374 —— 12 Interval estimation 


Note 12.7. Many books may proceed with the assumption that x and y are indepen- 
dently distributed, in situations like blood pressure example, claiming that the effect 
of the drug is washed out after two hours or dependency is gone after two hours. As- 
suming statistical independence in such situations is not a proper procedure. When 
paired values are available we can handle by using u as described above, which is 
a correct procedure when the joint distribution is normal. If the joint distribution is 
not normal, then we may evaluate the distribution of a linear function first and then 
use a linear function to construct confidence intervals for linear functions for mean 
values. 


Example 12.8. The following are the paired observations on (x, y) = (1, 4), (4, 8), (3.6). 
(2, 7) where x is the amount of a special animal feed and y is the gain in weight. It is 
conjectured that y is approximately 3x + 1. Construct a 95% confidence interval for 
(1) E(u) = Ely - Gx + 2] = p - 344 - 1, Ely) = pj, E(x) = yy, (2) variance of u, assuming 
that (x, y) has a bivariate normal distribution. 


Solution 12.8. Let u = y - 3x - 1. Then the observations on u are the following: 


uj-4-20)-1-1 u,-8-2(4-1--1 u4-6-2()-1--1, 


1 1 
uy, =7-2(2)-1=2, απ 1 ροές 


1 
2 2 2 2 

Observed value = + [(1 3 «(a N «(a 3 «(2 ;J]- 108. 

3 16 x3 


"Lut (12.19) 


is Student-t with 3 degrees of freedom. [Since all linear functions of normal variables 
(correlated or not) are normally distributed, u is N (y, 0?) where ΜΞ ΕΙ), o? = Var(u).] 
tn1,¢ = 150025 = 3.182 from Student-t tables (see the illustration in Figure 12.3). Hence 
a 95% central confidence interval for E(u) = μ; - 344 -- 1 is the following: 


1 VIOS 1 VIO 
n- tps pf po |-| rep ώμος 9108 
3$ m 2 yn] la 4V) 4 A(V12) 


= [-2.14, 2.64]. 


For constructing a 95% confidence interval for Var(u), one can take the pivotal quantity 
as 


: (uj =u)? 2 2 2 2 
» g7 ~Xn-1 ΧΙ χάος” 9.35,  X5,0.975 = 0.216. 


jc 


See the illustration of the percentage points in Figure 12.4. Then a 9596 central confi- 
dence interval is given by the following: 


12.7 Confidence intervals for linear functions of mean values ---- 375 


jai (uj - y Ec 108 108 
Gar Ui deris 16(9.35)' 16(0.216) 
^2 LE 


= [0.72, 31.25]. 


Note 12.8. Note that in the paired variable (x, y) case if our interest is to construct 
a confidence interval for ji — 4; then take u = y - x and proceed as above. Whatever 
be the linear function of μι and u, for which a confidence interval is needed, take 
the corresponding linear function of x and y as u and then proceed. Do not assume 
statistical independence of x and y unless there is theoretical justification to do so. 


12.7.2 Confidence intervals for linear functions of mean values when there is 
statistical independence 


If x and y are statistically independently distributed with E(x) = μι, E(y) = μ», Var(x) = 
01, Var(y) = σὲ and if simple random samples of sizes n, and n, are available from x 
and y, then how can we set up confidence intervals for au, + bu, + c where a, b, c are 
known constants? Let x;,...,x,, and γι, ...,y,, be the samples from x and y, respec- 
tively. If x and y are normally distributed then the problem is easy, otherwise one has 
to work out the distribution ofthe linear function first and then proceed. Let us assume 
that x ~ N(4,07), y ~ Νίμ»,σ2) and be independently distributed. Let 


DO. e XM NN NM 
X= = ) j= = ^ νὲ- Σα, -x v=} o- (12.20) 
3 5 J 


and u = ax + by + c. Then u ~ N(u, 0°), where 
p = E(u) = aE[x] + bE[y] +c = au + bp, +c 
o? = Var(u) = Var(ax + by + c) = Var(ax + by) 
= a’ Var(x) + b? Var(y) 
since x and y are independently distributed 


2 2 
σ 0: 
2 2 lip 2 


ny n) 


0^-a 
Our interest here is to set up confidence intervals for ay, + bu, + c. A usual situation 
may be to set up confidence intervals for yw, - μι. In that case, = 0, b=1, a = -1. 
Various situations are possible. 


Case 1 (0? and o2 are known). In this case, we can take the pivotal quantity as the 
standardized u. That is, 
u-E(u) _ u- (ap, + buy +c) 
JVar(u) ye à pË 
n n, 


1 


~ N(0,1). (12.21) 


376 —— 12 Interval estimation 


Hence a 100(1 - a)% central confidence interval for ay, + bu, + c is the following: 


2 2 2 
+p uus, gi + p22] (12.22) 
n n 


where Za is illustrated in Figure 12.2. 


Case 2 (0? = 02 = σ = unknown). In this case, the population variances are given to 
be equal but it is unknown. In that case, we can use a Student-t statistic. Note from 
(12.20) that E[v?] = (πι - 1)o7 and Ε[ν2] = (n; - 1)02, and hence when 0? = σὲ = σ then 
E[v] + v5] = (πι + n; - 2)0? or 


μμ 20 3 


E[v^] =E 
n*nj-2 


(12.23) 


Hence δ’ = v? can be taken as an unbiased estimator of o°. In the standardized normal 
variable if we replace o? by δ’, then we should get a Student-t with n; +n, — 2 degrees 
of freedom because the corresponding chi-square has n, 4 n, - 2 degrees of freedom. 
Hence the pivotal quantity that we will use is the following: 


(ax + by + c) - (ap; + bu, +c) (αἵ + by +c) - (ap, + buy +c) 
x 2 b2 2 p 
δ ET vere 


~ προ 2 


(12.24) 


where v is defined in (12.23). Now a 100(1 -- a)% central confidence interval for ay, + 
bu, + c is given by 


(ax + by +c) ¥ ty +ny-2, avy) |: (12.25) 
The percentage point tn, +ny-2,4 is available from Figure 12.3 and v is available from 
(12.23). If the confidence interval for μ; - μι is needed, then put = 0, b=1,a=-1in 


(12.25). 


Case 3 (0j and 0? are unknown but πι > 30, n; > 30). In this case, one may use the 
following approximation to standard normal for setting up confidence intervals. 


(ax + by +c) - (au * bio +0) χο) (12.26) 


approximately, where s? = Ἐπ & i » 52 5 -ΣΕΙ SIM TL " are the sample variances. When 
n, and n, are large, dividing by n; or nj - 1 fori = is 2 will not make a difference. Then 
the approximate 100(1 -- a)% central confidence interval for ay, + bu, + c is given 


12.7 Confidence intervals for linear functions of mean values — 377 


by 


(ars dysorre, S +—4 (12.27) 
^in n, 


where the percentage point Za is available from the standard normal density in Fig- 
ure 12.2. 


12.7.3 Confidence intervals for the ratio of variances 


Here again, we consider two independently distributed normal variables x ~ N (j4, 02) 
and y ~ N(j5,02) and simple random samples of sizes n, and n, from x and y, respec- 
tively. We would like to construct a 100(1 — a)96 confidence interval for 0 = 5. We will 
make use of the property that 


o2 Xnj-1 
n, 
à; - y 2 
o2 n-1 


(3) 565 - 27/0 - 1] ( 1 ) 


[Xà6;-?/(m - DIO 
~ Faini (12.28) 

From this, one can make the following probability statement: 

1 
PHR 3 ics ES u( =) ES πμ. | Ξ1--α. 
Rewriting this as a statement on 0, we have 
u u 

Pr «0x | -1-a 12.29 
F f (12.29) 


n;-inj-15 n;-lnj-11-5 


where the percentage points πω and Fy-15,-11-3 are given in Figure 12.5, and 


nz [Σια - 3)? /(m - 0] 


Ay- 7 θΕη ut 12.30 
[X56 -Y?/(n; - 2] ο (12.30) 


1-α wt Fan) 
α 
5 > us 2 
Fant E Fang 


Figure 12.5: Percentage points from a F-density. 


378 —— 12 Interval estimation 


Note 12.9. If confidence intervals for a3 = a0, where a is a constant, is needed 
then multiply and divide u in (12.28) m. a, absorb the denominator a with 0 and 
proceed to get the confidence intervals from (12.29). Also note that only the central 
interval is considered in (12.29). 


Note 12.10. Since F-random variable has the property that F,,,, = p— we can con- 
vert the lower percentage point F nn,1-a/2 to an upper percentage od on Fr ma/2- 
That is, 


(12.31) 


Hence usually the lower percentage points are not given in F-tables. 


Example 12.9. Nine test plots of variety 1 and 5 test plots of variety 2 of tapioca gave 
the following summary data: s? = 10 kg and 52 = 5kg, where s? and s? are the sample 
variances. The yield x under variety 1 is assumed to be distributed as N(y,, 07) and the 
yield y of variety 2 is assumed to be distributed as N(j15,02) and independently of x. 
Construct a 9096 confidence interval for 35 


Solution 12.9. We want to construct a 9096 confidence interval and hence in our no- 
tation, α- 0.10, 5 = 0.05. The parameter of interest is 30 = 373 7|. Construct interval for 
0 and then ‘multiply by 3. Hence the required statistic, in observed value, is 
X65 - 7/0 - 1] 
— DA; -y/n -1)] 
_ [951/(8)] 
d 


[ολα] —Fg, and in observed value 
_ [ 9)G0)] [000] 9 
| 8 / | 4 | 5 


From F-tables, we have F; 40.05 = 6.04 and F; 80.05 = 3.84. Hence a 90% central confi- 
dence interval for 30 is given by 


| 27 27 Bi 27 Ae 


5(F3 4,0,05) 5(Fs,4,0.95) 5(Fe.4,0.05). 5 
27 27(3.84) | 

à = [0.89,20.74]. 

| 5(6.04) 5 l | 


Note 12.11 (Confidence regions). In a population such as gamma (real scalar ran- 
dom variable), there are usually two parameters, the scale parameter f, B > 0 and 
the shape parameter a, a » O. If relocation of the variable is involved, then there 
is an additional location parameter y, -οο « y « co. In a real scalar normal popu- 


12.7 Confidence intervals for linear functions of mean values — 379 


lation Ν(μ, 0°), there are two parameters μ, —oo < u < co and o°, O < g? < co. The 
parameter spaces in the 3-parameter gamma density is 


Q = ((a B, y) | O «a« 00,0 < B < co,-oo < y < oo]. 


In the normal case, the parameter space is Q = {(u, 0°) | -co < H < 00,0 «σ᾽ < oo]. 
Let 0 = (0,, ...,0,) represent the set of all parameters in a real scalar population. In 
the above gamma case, 0 = (a, B, y), s 3 and in the above normal case ϐ = (y, o?), 
S =2. We may be able to come up with a collection of one or more functions of the 
sample values x,, ..., x, and some ofthe parameters from 0, say, P = (P,,..., P,) such 
that the joint distribution of P is free of all parameters in 0. Then we will be able to 
make a statement of the type 


Pr{P e R}=1-a (12.32) 


for a given a, where R, is a subspace of R” = Rx R x --- x R where R is the real line. 
If we can convert this statement into a statement of the form 


Pr(S, covers 0}=1- a (12.33) 


where S; is a subspace of the sample space S, then S, is the confidence region for 
0. Since computations of confidence regions will be more involved, we will not be 
discussing this topic further. 


Exercises 12.7 


12.71. In a weight reduction experiment, a random sample of 5 individuals under- 
went a certain dieting program. The weight of a randomly selected person, before 
the program started, is x and when the program is finished it is y. (x, y) is assumed 
to have a bivariate normal distribution. The following are the observations on (x, y): 
(80, 80), (90,85), (100,80), (60,55), (65,70). Construct a 95% central confidence inter- 
val for (a) μι - uj, when (1) variance of x -- y is 4, (2) when the variance of x - y is 
unknown; (b) 0.244 — 4; when (1) variance of u = 0.2x — y is known to be 5, (2) variance 
of u is unknown. 


12.7.2. Two methods of teaching are experimented on sets of n, = 10 and n, = 15 stu- 
dents. These students are assumed to have the same backgrounds and are indepen- 
dently selected. If x and y are the grades of randomly selected students under the two 
methods, respectively, and if x ~ Νίμι, 02) and y ~ Νίμ»,σ3) construct 9096 confidence 
intervals for (a) μι — 245 when (1) 0? = 2, σὲ = 5, (2) of = σὲ but unknown; (b) 202/02 
when (1) μι = -10, Wy = 5, (2) 44, p; are unknown. The following summary statistics are 
given, with the usual notations: x = 90, y = 80, sł = 25, 52 = 10. 


380 —— 12 Interval estimation 


12.7.3. Consider the same problem as in Exercise 12.6.2 with πι = 40, n, = 50 but 0? 
and σὲ are unknown. Construct a 9596 confidence interval for μι — p, by using the 
same summary data as in Exercise 12.7.2. 


12.7.4. Prove that Fin nia = F : 


nma 


12.7.5. Let x;,...,x, be iid variables from some population (discrete or continuous) 
with mean value p and variance o? < oo. Use the result that 


vi ~ N(0,1) 


approximately for large n, and set up a 100(1 -- a)96 confidence interval for u when o? 
is known. 


12.7.6. The temperature reading x at location 1 and y at location 2 gave the following 
data. A simple random sample of size n; = 5 on x gave X = 20c and sj = 5c, and a ran- 
dom sample of n, = 8 on y gave y = 30c and 52 = ἃς. If x ~ N(u,,07) and y ~ N(j5,02) 
and independently distributed then construct a 9096 confidence interval for g, 


13 Tests of statistical hypotheses 


13.1 Introduction 


People, organizations, companies, business firms, etc. make all sorts of claims. A busi- 
ness establishment producing a new exercise routine may claim that if someone goes 
through that routine the expected weight reduction will be 10 kilograms. If 0 is the 
expected reduction of weight, then the claim here is 0 - 10. A coaching centre may 
claim that if a student goes through their coaching scheme, then the expected grade in 
the national test will be more than 90%. If u is the expected grade under their coach- 
ing scheme, then the claim is that μ > 90. A bird watcher may claim that birds on 
the average lay more eggs in Tamilnadu compared to Kerala. If the expected num- 
ber of eggs per bird nest in Tamilnadu and Kerala are respectively µι and po, then 
the claim is μι > Hy. A tourist resort operator in Kerala may claim that the true pro- 
portion of tourists from outside Kerala visiting his resort is 0.9. If the probability of 
finding a tourist from outside Kerala in this resort is p, then the claim is that p = 0.9. 
An economist may claim that the incomes in community 1 is more spread out com- 
pared to the income in community 2. If the spreads are denoted by the standard de- 
viations σι and σ», then the claim is that σι > 0;. An educationist may claim that the 
grades obtained by students in a particular course follow a bell curve. If the typical 
grade is denoted by x, then the claim is that x is normally distributed if a traveler 
claims that the travel time needed to cover 5 kilometers in Ernakulam during peak 
traffic time is longer than the time needed in Trivandrum. Here, the claim is that one 
random variable is bigger than another random variable over a certain interval. In 
all the above examples, we are talking about quantitative characteristics. If a soci- 
ologist claims that the tendency for the destruction of public properties by students 
and membership in a certain political party are associated, then the claim is about 
the association between two qualitative characteristics. If an engineer claims that in 
a certain production process the occurrence of defective items (items which do not 
have quality specifications) is a random phenomenon and not following any specific 
pattern then we want to test the randomness of this event. If the villagers claim that 
snake bite occurring in a certain village follows a certain pattern, then we may want 
to test for that pattern or the negation that there is no pattern or the event is a random 
event. If a biologist claims that the chance of finding a Rosewood tree in Kerala forest 
is much less than that in Karnataka forest, then the claim may be of the form p, < pp, 
where p, and p; are the respective probabilities. If a physicist claims that every par- 
ticle attracts every other particle, then we classify this hypothesis as a conjecture if 
“attraction” is properly defined. This conjecture can be disproved if two particles are 
found not having attraction. If a religious preacher claims that the only way to go to 
heaven is through his religion, then we will not classify this as a hypothesis because 
there are several undefined or not precisely defined items such as “heaven” and the 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. JBA] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https: //doi.org/10.1515/9783110562545-013 


382 —— 13 Tests of statistical hypotheses 


method of reaching there, etc. There is no way of collecting data and verifying the 
claim. 

We are looking at hypotheses where one can take data on some observable ran- 
dom variables and test the claims, or test the hypothesis by using some statistical pro- 
cedures or we are looking for some verifiable or testable types of claims. 

We have looked into various types of claims or hypotheses above. Out of these, 
the one about heaven has un-identifiable terms, and hence we do not count it as a 
hypothesis, and we cannot collect data either, to verify or test the claim. We will clas- 
sify hypotheses into two types: statistical and non-statistical. A statistical hypothesis 
has something to say about the behavior of one or more random variables. If the hy- 
pothesis is well-defined but no random phenomenon is involved, then we call such 
hypotheses as non-statistical hypotheses. Many of the physical laws or mathematical 
conjectures are non-statistical hypotheses. 


Statistical hypothesis 


A hypothesis di 


Non-statistical hypothesis 


Inthis chapter, we are concerned with statistical hypotheses only. Out ofthe many 
statistical hypotheses described above, we have noted that some of them are dealing 
with parameters of well-defined distributions and others are of the type of testing hy- 
potheses on qualitative characteristics, some are about randomness of phenomena, 
some are about certain patterns, etc. If a hypothesis is about the parameter(s) of well- 
defined distributions, then we call them parametric hypotheses. All other statistical 
hypotheses will be called non-parametric hypotheses. 


Parametric hypothesis 


A statistical hypothesis d 


Non-parametric hypothesis 


First, we will concentrate on parametric statistical hypotheses and later we will 
deal with non-parametric hypotheses. 


13.2 Testing a parametric statistical hypothesis 


We may test a parametric hypothesis of the type that the expected waiting time in a 
service station for servicing a car is greater than or equal to 40 minutes, then this hy- 
pothesis is of the type 0 > 40 where 0 is expected waiting time, the waiting time may 
have an exponential distribution with expected value 0. Then the hypothesis is on a 
parameter of a well-defined distribution. For testing a statistical hypothesis, we may 
take some data from that population, then use some test criteria and make a decision 
either to reject or not to reject that hypothesis. If the decision is to reject or not to reject, 


13.2 Testing a parametric statistical hypothesis —— 383 


then we have a two-decision problem. If our decision is of the form, reject, not to re- 
ject, take more observations because a decision cannot be reached with the available 
observations, then it is a three-decision problem. Thus we may have a multiple deci- 
sion problem in any given parametric statistical hypothesis. First, we will consider a 
two-decision situation. Here also, there are several possibilities. We have a hypothesis 
that is being tested and the natural alternate against which the hypothesis is tested. 
If we are testing the hypothesis that 0 > 10, then we are naturally testing it against the 
alternate that 0 « 10. If we test the hypothesis 0 - 20, then we are testing it against its 
natural alternate 0 # 20. 


Definition 13.1 (Null and alternate hypotheses). A hypothesis that is being tested 
is called the null hypothesis and it is usually denoted by Ηρ. The alternate hypothe- 
sis, against which the null hypothesis Ηρ is tested, is called the alternate hypothesis 
and it is usually denoted by H, or H4. We will use the notation Hj. 


A null parametric hypothesis Ho 


A parametric hypothesis S 


An alternate parametric hypothesis H, 


The term “null” came due to historical reasons. Originally, the claims that were 
tested were of the type that there is significant difference between two quantitative 
measurements such as the yield of corn without using fertilizers and with the use of 
fertilizers. The hypothesis is usually formulated as there is no significant difference 
(hypothesis of the type Hp : 44 = 4;) between expected yields and is tested against the 
hypothesis that the difference is significant (hypothesis of the type u, # y2). Nowadays 
the term "null" is used to denote the hypothesis that is being tested whatever be the 
nature of the hypothesis. 

We may also have the possibility that once the hypothesis (null or alternate) is 
imposed on the population the whole population may be fully known, in the sense of 
no unknown parameters remaining in it or the population may not be fully known. If 
the population is exponential with parameter 0 and if the hypothesis is Hp : 0 = 20, 
then when the hypothesis is imposed on the density there are no more parameters left 
and the density is fully known. In this case, we call Ηρ a “simple hypothesis". The 
alternate in this case is H, : 0 # 20. Then under this alternate, there are still a lot of 
values possible for θ, and hence the population is not determined. In such a case, we 
call it a *composite" hypothesis. 


Definition 13.2 (Simple and composite hypotheses). Once the hypothesis is im- 
posed on the population if the population is fully known, then the hypothesis is 
called a simple hypothesis and if some unknown quantities are still left or the pop- 
ulation is not fully known, then that hypothesis is called a composite hypothesis. 


384 —— 13 Tests of statistical hypotheses 


A simple hypothesis 


A parametric hypothesis < 


A composite hypothesis 


Thus, a null parametric hypothesis can be simple or composite and similarly an 
alternate parametric hypothesis can be simple or composite. For example, let us take 
anormal population Ν(μ, σ΄). There are two parameters u and σ΄ here. Let us consider 
the following null and alternate hypotheses: 

ΒΠρ:µΞΌΟ, 0? = 1 (simple), alternate ΒΠι:μ30, 0? #1 (composite); 

Hy : u = 0 (composite), alternate H; : u 0 (composite), o? is unknown; 

Ho : y <5, 0? =1 (composite), alternate H; : u > 5, 0? = 1 (composite); 

Hy : u - 0, 0? < 4 (composite), alternate H; : u = 0, 0? > 4 (composite). 

The simplest testing problem that can be handled is a simple null hypothesis ver- 
sus a simple alternate. But before starting the testing procedure we will examine more 
details. When a decision is taken, after testing Ho, either to reject or not to reject Ho, 
we have the following possibilities. The hypothesis itself may be correct or not correct. 
Our decision may be correct or wrong. If a testing procedure gave a decision not to re- 
ject the hypothesis that does not mean that the hypothesis is correct. If the expected 
waiting time in a queue, 0, is hypothesized as Ηρ : 0 = 10 minutes, this does not mean 
that in fact the expected waiting time is 10 minutes. Our hypothesis may be different 
from the reality of the situation. We have one of the following four possibilities in a 
given testing situation: 


Hypothesis Ho 
LN 
Hg is true Hg is not true 
Reject Ho Type-I error Correct decision 


RH ^ 
Decision & 


Not reject Họ Correct decision Type-II error 


There are two situations of correct decision and two situations of wrong deci- 
sion. The error committed in rejecting the null hypothesis Ηρ when it is in fact true 
is called type-I error and the error of not rejecting when Hj itself is not true is called 
type-II error. The probabilities of committing these two errors are denoted by a and f. 
That is, 


a = Pr{reject Ηρ|Ηρ is true}, (13.1) 
P = Prínot reject Ηρ|Ηρ is not true}, (13.2) 


where a vertical bar indicates “given”. As an example, consider the following: Suppose 
that we want to test the hypothesis Hp : 4 = 2 in a normal population N(j, 1). Here, 


13.2 Testing a parametric statistical hypothesis —— 385 


0? is known to be one. Suppose that by using some procedure we have come up with 
the following test criterion: Take one observation from this normal population. Reject 
the null hypothesis if the observation is bigger than 3.75, otherwise not to reject Ηρ. 
Here, a is the probability of rejecting Hy when it is true or the probability of rejecting 
Ho :u - 2whenin fact the normal population is N(u = 2,0? = 1). Hence this probability 
is given by the following integral: 


ἫΝ L e40- dx 
3.75 Ύ2π 


μυ. -T 1,2 
. ——e 27 dy, =(x-p)=(x-2 
|. 5 y» y A ) 


= 0.04 


from standard normal tables. Then f is the probability of not rejecting when p + 2. We 
do not reject when x « 3.75 as per our criterion. Then 


Los "Var 
ή 1 -iyd 
Ξ -----6 2 Ρ Ξχ- 
bes τση y y A 
which gives 

o9 1 2 

1- -[ LeMay 
375-4 ν2π 


Note that f as well as 1 — β can be read from standard normal tables for every given μ. 
In general, f is a function of p or f = B(y). For u = 2, then 1 - B = a here. 

We have seen that once a test criterion is given one can compute a - the probabil- 
ity of type-I error and f = the probability of type-II error. But how to come up with a 
test criterion? If we can make a = 0 and f = 0 and come up with a criterion then that 
is the best possible one. But we can see that we have a random situation or our prob- 
lem is not a deterministic type, and hence making a and β or one of them zero is not 
possible. Then is it possible to minimize both a and f simultaneously and come up 
with a criterion? It can be seen that simultaneous minimization is not possible. This 
may be easily seen from the above illustrative example of N(y, 1) and testing hypoth- 
esis Hy : u = 2. Take a simple situation of a simple Hy versus a simple H,. Suppose 
that packets are filled with beans by an automatic filling machine. The machine is set 
for filling 2 kilograms (kg) per packet or the expected weight u = 2kg. If the machine 
is filling without cutting and chopping, then weight need not be exactly 2 kg but will 
be around 2kg per packet. Suppose that the machine is operating in Pala where there 
is current fluctuation and stoppage almost every minute. At one of the sudden fluc- 
tuations, the machine setting went off and the machine started filling 2.1kg instead 
of 2kg. A shopkeeper bought some of these packets. She wants to know whether the 


386 —— 13 Tests of statistical hypotheses 


packets belong to the 2 kg set or 2.1 kg set. These are the only possibilities. If the weight 
distribution is Ν(μ, 1), then µ can take only two values 2 or 2.1. If Hp is u = 2, then nat- 
urally H; here is μ = 2.1. Take our illustrative example for this simple Ηρ versus simple 
Ηι case. Our criterion was to take one observation and if it is bigger than 3.75 reject 
the Ηρ. We can reduce a by shifting our rejection point or critical point to the right 
then we see that automatically β is increased, and vice versa. Hence it is clear that si- 
multaneous minimization of a and £ is not possible. Then what is the next best thing 
to do? We prefix either a or 8 and minimize the other and then come up with a crite- 
rion. This procedure is possible. Then the question is which one to be pre-fixed? Which 
one is usually a more serious error? Suppose that a new drug is being introduced into 
the market for preventing a heart attack. The manufacturer's claim is that the drug can 
prevent a heart attack if administered within one hour of the appearance of symptoms 
of a heart attack. By a testing procedure, suppose that this claim is rejected and the 
drug is rejected. Suppose that the claim was correct. The net result is a loss of money 
for developing that drug. Suppose that the claim was not correct and the testing pro- 
cedure did not reject the drug. The net result is that a lot of lives are lost. Hence usually 
a type-II error is more serious than a type-I error. Therefore, what is done is to prefix 
a and minimize f or prefix a and maximize 1 - f. 

Let us see what is happening when we have a test criterion. If we have one obser- 
vation χι, then our sample space is the x-axis or the real line. If the test criterion says 
to reject Ηρ if x > 5, then the sample space S is split into two regions C c S, where 
C = {x | x, 2 5] and the complementary region C c S where Ηρ is not rejected. If we 
have two observations (x4, x5), then we have the plane. If the test criterion says to re- 
ject Ηρ if the sample mean is greater than or equal to 1, then the rejection region in 
the sample space S is C = {(x,,X) | x, + x; 2 2} c S. Figure 13.1 shows the illustration in 
(a), (b), and in (c) the general Venn diagrammatic representation of the sample space 
S and the rejection region C is given. 


eae = Re 
S— 
5 
X +X 5C 


Figure 13.1: Critical regions. 


Definition 13.3 (Critical region and size and power of the critical region). The re- 
gion C, C c S where S is the sample space, where the null hypothesis Ηρ is rejected, 
is called the critical region. The probability of rejecting the null hypothesis when it 
is true is a, that is, the sample point falls in the critical region when Hj is true, or 


a = Pr{x e C|Ho} (13.3) 


13.2 Testing a parametric statistical hypothesis — 387 


is called the size of the critical region or size of the test or size of the test criterion, 
where x = (Χι,...»Χῃ) represents the sample point. Then 1 - f = the probability of 
rejecting Ηρ when the alternative H, is true is called the power of the critical region 
C or power of the test. 


If Hy is that 0 € w, w c Q, that is, 0 belongs to the subset w of the parameter space, 
then H; is that 0 e w, where w is the complement of w in Q. In this case, both a and f 
will be functions of 0. For example, for the normal population N (y, o? - 1), if the null 
hypothesis Ηρ is 4 x 5 then H; is p > 5. In both ofthe cases, there are plenty of u values 
present, and hence a = a(y) and f = f(u). We may write the above details in symbols 
as follows: 


a-a(0)-Pr(xe C]Hog, x-(X«...,x), Hp: PewcO 


- size of the critical region C (13.4) 
1-B-1-f(0)-Pr(xeC|Hj, H,:0ewco 
- power of the critical region C. (13.5) 


Definition 13.4 (Most powerful (MP) and uniformly most powerful (UMP) tests or 
critical regions). If Hp : 0 = θρ and H; : 0 = 0, are simple, that is, the populations are 
fully known once the hypotheses are implemented, then there are only two points 
0, and 6, in the parameter space. If C is the critical region (or test) of size a, which 
has more power compared to any other critical region (or test) of the same size a, 
then C is called the most powerful critical region (or test). If Ηρ or H, or both are 
composite, then if C is the critical region of size a(0) and has more power compared 
to any other critical region of the same size a(0), and for all values of 0 e w c Q then 
C is called uniformly most powerful critical region (or test). 


Definition 13.5 (Power curve). If the power 1 - β(θ) is drawn against 0, assuming 
that 0 is a scalar parameter, then the resulting curve is called the power curve. 


In Figure 13.2, power curves are drawn for three critical regions C, D, E, all having the 
same size a. For 0 > θῃ, both C and D have more power compared to E. For 0 < 05, both 
C and E have more power compared to D. For all 0 + 05, C has more power compared 


Qo — 0 Figure 13.2: Power curves. 


388 ---- 13 Tests of statistical hypotheses 


to D and E. Thus C is uniformly more powerful compared to D and E. If there is a C, 
which is uniformly more powerful than any other critical region (or test) of the same 
size a, then C is uniformly the most powerful critical region (or test). 


Thus, for constructing a test criterion our procedure will be the following: If it is 
the case of simple Ηρ versus simple H,, then look for the most powerful test (MPT). If 
Ho or H; or both are composite, then look for uniformly most powerful test (UMPT). 
First, we will consider the situation of simple Ηρ versus simple H,. An example of 
this type was already discussed, of automatic filling of bags with beans, where the 
machine was set for the expected weight ofthe packet 2 kg but due toa machine setting 
changing to 2.1 kg unknowingly, due to power surge, the expected weight changes to 
2.1 for 2.0. There are only two parameter values here 2.0 and 2.1. In the simple Ηρ versus 
simple H,, there is a small result which will give a procedure of constructing a test 
criterion for the most powerful test (MPT). This is called Neyman- Pearson lemma. 


Result 13.1 (Neyman-Pearson lemma). Let x = (xj, ...,x,) be a sample point and let 
L =L(%,...,X,,9) = joint density or probability function of the sample point. Let Lo 
be L under Ηρ and Γη be L under Ηι. Let Πρ and H, be simple (Ηρ will be of the form 
0 = 0, (given), and H, is of the form 0 = 0, (given)). Let the population support be 
free of the parameter(s) 0. Then the most powerful test (MPT) or the most powerful 
critical region C is given by the rule 


<k inside C for some constant k > 0 


mi 
il 

ο >k outside C 
1 


then C is the most powerful critical region. 


Proof. Let the size of the critical region C be a. Let D be any other critical region of 
the same size a. Then 


a = Pr{x e C|Ho} = Pr{x e D|Ho}. 


We will give the proof in the continuous case, and for the discrete case the steps are 
parallel. In the discrete case, the size of the critical region is to be taken as < a because 
when adding up the probabilities at individually distinct points we may not hit the 
exact value a. Then take the closest value but <a. Then 


a- | Lod = | Lod + | Lodx; a- | Lod = | Lod + | Lodx, (a) 
C CnD CAD D CnD CAD 


as shown in Figure 13.3, where x = (x,,...,x,) and dx = dx, A--- Adx,, and f. standing 
for the multiple integral. 


13.2 Testing a parametric statistical hypothesis —— 389 


G3 


From equation (a), we have 


Figure 13.3: Illustration of Neyman-Pearson Lemma. 


CnD 


| r2] Lodx. (b) 


Let C bethe critical region satisfying the condition A < kinside C. Consider the power 
of C, denoted by p,. Then 


p=1-8=| Lax= | Lx | L,dx. 
C C 


nD CoD 
But Cn D c C, and hence, inside C, Lo ck or ΓΣ fo. Therefore, we can write 
1 


| Laxe | Lo ay. 
CnD cnb k 


Then 


-[ “ax | L,dx from (b). 
k CnD 


But ĈN D is outside C and, therefore, τ > k. Therefore, substituting this, we have 
ΡιΣ f L,dx + | L,dx = | L,dx = power of D. 
CnD cnD D 
Therefore, C is the most powerful critical region. 


Example 13.1. Consider a real scalar exponential population with parameter 0 and a 
simple random sample x, ..., x, of size n. Let Ηρ be 0 5 or 0 = 0, (given) and H; be 
0 - 10 or 0 = 6, (given). Assume that the parameter space Q consists of 05 and 6, only. 
Construct the MPT or the most powerful critical region. 


Solution 13.1. 


Consider the inequality 


n 
Lo «Κ = θὲ ec (yc Heat) <k. 
Lı 6ο 


390 —— 13 Tests of statistical hypotheses 


Taking the natural logarithms on both sides, we have 


0, 1 1 
ance el g Jo t tx) sink > 
1 1 
ο g, Jott +n) ο ~1 85) =k > 
1 1 
ex ο. 


Let 0, > 05. In this case ( & - x) > Oand then dividing both sides the inequality remains 
the same or we have x, + --- + x, 2 K, for some k,. But we know the distribution of 
u =X; + +X, Which 15 a gamma with the parameters (a = n, f = 0). But we have 
0 = @ or0 = 04. In both of the cases, the gamma density is fully known, and hence we 
can compute percentage points. Let 


α-[ Lodx = Prix, +--+ x, 5 18 = 0,) 
C 


co ut /8 
= e™ odu. 
| T'(n)0j 


Fora prefixed a, compute ug from a gamma density (known) and then the test criterion 
says: Reject Ηρ if the observed u = x, + --- +X, 2 Ug as shown in Figure 13.4. This is the 
most powerful test. Note that if 0; < θῃ then the MPT would have been the following: 
Reject Ηρ if u x u, κ. [This is left as an exercise to the student.] 


^ 


Figure 13.4: Percentage points for gamma density. 


Exercises 13.2 


13.2.1. The ball-bearing for an ambassador car is manufactured by the car manufac- 
turer's outfit at Kolkata. Identical duplicates are manufactured by some outfit in Pun- 
jab. It is found that the true percentage of defective ones produced by the Kolkata firm 
is 1096 and that of the Punjab firm is 1596. A spare parts dealer has the stock of the orig- 
inal and duplicate spare parts. A garage bought 10 ball-bearing and 3 were found to 
be defective; test the hypothesis at the 596 level of rejection that the garage's lot were 
duplicates. 


13.2 Testing a parametric statistical hypothesis — 391 


13.2.2. On a particular stretch of a highway, the expected number of monthly traffic 
accidents is 5, and if traffic policemen control the traffic then the expected number is 3. 
Assume that the number of traffic accidents there is Poisson distributed. Randomly 
selected 4 months gave the data 0, 3, 2, 3 accidents. Test the hypothesis, at the 5% level 
of rejection, that the traffic policemen were present on that stretch of the highway. 


13.2.3. In the out-patient section of a small clinic, only one of two doctors Dr X and 
Dr Y will be present to attend to the out-patients on any given day. If Dr X is present, 
the expected waiting time in the queue is 30 minutes, and if Dr Y is present, then the 
expected waiting time is 40 minutes. Randomly selected 5 out-patients' waiting times 
on a particular day gave the data 50,30, 40,45,25 minutes. Test the hypothesis at a 
1096 level of rejection, that Dr Y was present on that day, assuming an exponential 
distribution for the waiting time. 


13.2.4. A margin-free shop has packets of a particular brand of potatoes marked as 
5 kg packets. These packets are packed by automatic packing machines, for the ex- 
pected weight of 5 kg, without cutting and chopping of potatoes. Sometimes the ma- 
chine setting slips to 4.7 kg unknowingly. Four different housewives independently 
bought these bags of potatoes on a particular day and found to have the exact weights 
3.6, 4.8, 4.8, 5.0 kg. Test the hypothesis, at a 5% level of rejection, that the machine set- 
ting was 4.7 kg when those packets were packed. Assume that the weight is approxi- 
mately normally distributed Νίμ, 07) with o? = 0.04 kg. 


13.2.1 The likelihood ratio criterion or the A-criterion 


The Neyman-Pearson lemma leads to a more general procedure called the lambda 
criterion or the likelihood ratio test criterion. Consider a simple random sample X = 
(3, ...,X4) from some population f (x, 0). Then the joint density/probability function is 
L(X,0) = jel f(x;,0). We can maximize this L(X, 0) over all 0 e Q, the parameter space. 
If there are several maxima, then take the supremum. Let the null hypothesis Ηρ be 
0 e w c Q. Under Ηρ, the parameter space is restricted to w. Then L(X,0)|g e w will be 
denoted by Ly. Then the likelihood ratio criterion or A-criterion is defined, over the 
support of f (x, 0), as follows: 


λ.. Ἓβθεω ΓΩ 


(13.6) 
Supgco L 


Suppose that Ηρ was in fact true. In this case, w = Q and in this case A = 1. In general, 
O «A x1. Suppose that in a testing situation A is observed to be very small, close to 
zero. Then, from a layman’s point of view, something is wrong with the null hypoth- 
esis because if the hypothesis is true, then A = 1 and if it is nearly ok, then we could 
expect A to be close to 1. If Ais observed at the other end, then we must reject our null 


392 —— 13 Tests of statistical hypotheses 


hypothesis. Thus, we reject for small values of A. Let A, bea point near zero. Then the 
criterion is the following: 
Reject the null hypothesis if A x A, such that 


Pr{A x A,IHo) =a. (13.7) 


This is known as the A-criterion or the likelihood ratio criterion for testing the null 
hypothesis Hj. 


Example 13.2. By using the likelihood ratio criterion, develop a test criterion for test- 
ing Ho : u € uo (given) for u in a Ν(μ, 0?) where o? is known. 


Solution 13.2. Let X = (x;,...,x,) bea simple random sample from a N (μ, 0?) where 
9? is known. Then 


Tol oha 
LXg)-s[[——e 9 
(ἅμ) I me 
= 1 1 [ς aed viru 
UNS exp] 203 » Xy +n(X - p) |} (a) 


The maximun likelihood estimator of u over the whole parameter space Q is Χ. Here, 
0? is known. Hence 


sup L(X, u) = η 
ped μ-χ 


a 1 « ae) 
TT exp] 23 2% X) l. 


The joint density of X is given in (a). Now we want to maximize it under the null hy- 
pothesis Ho : μ < uo. From (a) 


maxL =>  maxlnnL = minn( - uy, . 
P 


If the observed x is less than µρ, then it is an admissible value for μ since j < uo, 
and hence the maximum likelihood estimate is x itself. Then A - 1 and we never re- 
ject Hy. Hence the rejection can come only when the observed x > µρ. In this case, 
(X - u can be made a minimum by assigning the maximum possible value for μ, 
which is μρ because μ < ug. Hence the maximum likelihood estimate for u under Ho 
is Up and we reject only for large values of x. Substituting these and simplifying, we 
have 
λ- exp|-—n& -n| 
20? 

which is a one to one function of n(x — Ho)? or one to one function of Vapo) ~ N(0,1) 
remembering that x > uy and we reject for large values of x or for large values of 


13.2 Testing a parametric statistical hypothesis —— 393 


w= ΝΗΣΟΣ, The probability coverage over this rejection region must be a for a pre- 


fixed a. Then from N(0,1) tables, we have 
Pr(z2z,) =a. (b) 


vn(x -μο) 
σ 


Hence the test statistic here is and the test criterion is: 


Reject Ηρ if o > 7. where z, is given in (b). 
0 σ α α 


Note 13.1. Was it possible to construct a test criterion if the null hypothesis was 
Ho : u < po in the open interval? From the procedure above, it may be noted that 
the maximum likelihood estimator (MLE) exists only when the boundary point μῃ 
is included. If is not included, then we could not have constructed A. In that case, 
we could have tested a hypothesis of the type 4 > uy against µ < Up. 


Note 13.2. If we had a hypothesis of the form Ηρ : u = uo (given), H; : 4 > ug and if 
we had proceeded to evaluate the likelihood ratio criterion λ, then we would have 
ended up with the same criterion as for the hypotheses Hy : 4 < uo, Πι : 4 > Uo. But 
hypotheses of the type Ηρ : u = Mọ, H; : 4 > pg can be created only if we know be- 
forehand that u can only take values uo or higher. You may find misinterpretations 
in some books in the name of *one-sided tests". Such procedures of one-sided state- 
ments are logically incorrect if u can also logically take values less than yo. Similar 
comments hold for the case Ηρ : 4 = uo, Hy : M < po. 


Note 13.3. Hypotheses are to be formulated before the data are collected. Hypothe- 
ses have to come from theoretical considerations or claims made by manufacturers, 
business firms, etc. or proclamations made by public figures, etc. After formulating 
the hypotheses, data are to be collected to properly represent the populations as- 
sumed under the hypotheses. If hypotheses are formulated by looking at the data 
in hand, then it will result in the misuses of statistical procedures. Suppose that on 
four random occasions the information about a habitual gambler's net gain or loss 
is collected. All the four occasions resulted in huge gains. If you formulate a hy- 
pothesis that this gambler will always win, then such a claim may not be rejected 
by using any testing procedure based on the data in hand. Naturally, your conclu- 
sions will be logically absurd. Suppose that a sociologist has collected the data on 
annual incomes of families in Kerala. She checked five families at random. Her ob- 
servations were 1000, 2000, 5000, 5 000, 2000 rupees. If she creates a hypothesis 
that the expected family income in Kerala can never be greater than 5000 rupees 
or another hypothesis that the expected range of annual incomes will be between 
1000 and 5000 rupees, both will result in absurd conclusions if the testing is done 
by using the data in hand. Hence hypotheses should not be formulated by looking 
at the data in hand. 


394 ---- 13 Tests of statistical hypotheses 


Example 13.3. The grades obtained by the students in a particular course are as- 
sumed to be normally distributed, Ν (μ, 0?), with o? = 16. A randomly selected set of 
four students gave the grades as 60, 70, 80, 60. Test the hypothesis Hy : u x 65 against 
H; : u > 65, at the 2.5% level of rejection. 


Solution 13.3. The observed sample mean x - 1 (60 + 70 + 80 + 60) = 67.5. Here, 
Ho = 65, n = 4. From a standard normal table, the 2.5% percentage point zo o5; = 1.96. 
Hence 
vn(X — Uo) σ 
--------ΞΖ2α => +Zy—= 
σ a Ho a vn 
But the observed x = 67.5, and hence the hypothesis is not rejected. 


-65+ 196( $) = 68.92. 


Remark 13.1 (“Acceptance of a hypothesis"). If the testing procedure did not reject 
the null hypothesis Ηρ, the hypothesis being tested, can we “accept” the null hy- 
pothesis? This is a point of misuse of statistical techniques oftesting of hypotheses. 
If you examine the procedures of constructing a test criterion, we see that itis done 
by minimizing the probability of type-II error and by prefixing the probability of 
type-I error and the whole procedure deals with rejecting Ηρ and not for anything 
else. If the hypothesis Ηρ is not rejected, then the procedure does not say anything 
about the decision to be made. When we do or do not reject our own hypothesis by 
using our own testing procedure we are not making a mathematical statement that 
such and such a thing is true. 


In our example, the construction of the test statistic had at least the logical founda- 
tion of the likelihood ratio principle. Very often statisticians take a pivotal quantity on 
their own, not coming from any principle or procedure, and claim that under their own 
hypothesis they know the distribution of their pivotal quantity, and their own testing 
procedure did not reject their hypothesis, and hence they are accepting the hypothe- 
sis. The logical fallacy of such a procedure and argument is very clear. Suppose that 
a farmer’s land bordering a forest area has troubles from wild animals. His vegetable 
gardens are usually destroyed. His claim is that wild buffaloes are the culprits. All buf- 
faloes have four legs. Suppose that he selected the property *animals having four legs" 
as the property based on which his hypothesis will be tested. Note that this property 
of having four legs is not a characteristic property of buffaloes. Most of the statistical 
tests are not based on characterizing properties. (When a test statistic is taken, there 
is no unique determination of the whole distribution.) The farmer checked on four 
different nights and counted the number of legs of the culprits. If on a majority of the 
nights he found two-legged wild fowls, two-legged drunkards (human beings) destroy- 
ing his vegetable garden. Then he can safely reject the hypothesis that wild buffaloes 
are destroying his vegetable garden. But suppose that on all nights he found 4-legged 
animals destroying his garden. He cannot accept the hypothesis that wild buffaloes 
are destroying his garden because 4-legged animals could be wild pigs (boars), por- 
cupines, elephants, etc., including buffaloes. 


13.3 Testing hypotheses on the parameters of a normal population Νίμ,σ2) ---- 395 


13.3 Testing hypotheses on the parameters of a normal 
population N(u, o?) 


13.3.1 Testing hypotheses on p in Ν(μ, 0?) when o? is known 


We have already looked into a problem of testing hypotheses on the mean value µ in a 
real scalar normal population N (u, 0?) when o? is known, as an illustrative example. 
By using the likelihood ratio principle and A-criterion we ended up with a criterion: 
reject Ho if ah) > Za for Hy : U< Mo (given), H, : u > Uo. This is a test at level a 
or the size of the critical region is a or test at level of rejection a. If we had the hy- 
potheses Ηρ : 4 2 uo, H; : H < Uo, where σ΄ is known, then we would have ended up 
f SE) < -Z,. Similarly, for the hypotheses Ηρ : 4 = uo, 
| maw 


with the criterion: reject Ho i 
H; : u + pg then the criterion would have been to reject Ηρ if |> Za. These two 
cases are left to the student as exercises. Also, it was pointed out that if the hypothe- 
ses were Hp : u = Mọ, H, : p > Mg the criterion would have been the same as in the case 
of Hp : H € Mo, Hy : u > po. Similarly, for the case Ηρ : u = uo, Hy : H < pg the criterion 
would have been the same as for the case Hy : H 2 Mg, H, :μ < pg. But, such one-sided 
statements can be made only if one knows beforehand that u cannot take the values 
less than μρ or greater than μρ, as the case may be. The results of the tests of hypothe- 
ses on u when o? is known can be summarized as follows: 


Case (1). Population N(u, σ΄), o? known. Ho : u < uo, H; : M > Ho- 
Test statistic: z — ME) ~ N(0,1). 


Criterion: reject Ηρ if the observed z 22, or AREA) 


2 Za OLX > Ho * Za i> 


Case (2). Population Nj, 0?), o? known. Ηρ : H 2 uo, Hy : H < pio. 
Test statistic: z — Ὑπά-μο) ~ N(0,1). 


Criterion: reject Ho if AED < -Za OE X € μρ-Ζα-ς 


Cm 
Case (3). Population N(u, 0°), o? known. Hy : u= Jug, Hy : M + Mo. 
Test statistic: z = AI) ~ N(0,1). 


Criterion: reject Ηρ if | Υπ Κω 5 Ζα ΟΙ X 2 Ho Zan F Or X < µο- Za F 


«—N (0,1) <N (0,1) 
eu ON 
0 Zo -Za 0 


Figure 13.5: Illustration of the criteria for Ho : H € Ho; Ho : H 2 Ho; Ho : H = Ho- 


The rejection region or critical region is z, < z < co for Case (1) or this region can be 


described as X > Up * z, Jm and a is the size of the critical region or level of the test. The 
σ 


critical point is z, in terms of Z or Mọ + Za Vu in terms of x. 


396 —— 13 Tests of statistical hypotheses 


—Q 


c 
H Ho + Za -FE 
: “Jn Figure 13.6: Critical point and critical region. 


Remark 13.2. From Figure 13.6, note that the probability coverage over X > Uy + 
[2 Um is a for every n. Now, assume that n is becoming larger and larger. Then the 
point Q starts moving towards µρ or the range for x to fall in the rejection region 
becomes larger and larger, with the same prefixed a. Finally, when n — co, by the 
weak law of large numbers, x goes to the true value μ or to µρ if the hypothesized 
Ho is the true value. For n becoming larger and larger, we keep on rejecting Ho. By 
our procedure, x must fall above uo. Hence critics of testing procedures say that we 


can always reject the null hypothesis by taking large enough sample size. 


Example 13.4. The temperature at Pala during the month of August seems to hover 
around 28 °C. Someone wishes to test the hypothesis that the expected temperature on 
any given day in August in Pala area is less than 28 ?, assuming that the temperature 
distribution is Νίμ, σ᾽), with o? = 4. The following is the data on temperature reading 
on randomly selected 4 days in August: 30, 31, 28, 25. Test at 2.596 level of rejection. 


Solution 13.4. The hypothesis of the type less than 28 °C cannot be tested because it 
is in the open interval. We can test Ηρ : 4 > 28 against H} : u < 28. Hence we formulate 
the hypothesis to be tested in this format. The observed sample mean x = 7(30 + 31+ 
28 + 25) = 28.5. z, = Zoos = 1.96. Ho — ZaGq = 28 - 1.96(5) = 26.04. But the observed 
value of x = 28.5 > 26.04, and hence the hypothesis Ηρ : u 2 28 is not rejected at the 
2.5% level. 


13.3.2 Tests of hypotheses on p in Νίμ,σ2) when c? is unknown 


Here, since g? is unknown we need to estimate g? also. These estimates in the whole 
parameter space are ji = x and 6? = s? = 1 Maj - Xy. Hence, substituting these, we 
have the maximum of the likelihood function, given by 


maxL- —L eun, (13.8) 
po? (s^21)? 
Let us try to test the null hypothesis µ > µρ (given), against H; : 4 < uo. If the observed 
X falls in the interval μρ x X < co, then it in the admissible range of u, under Ηρ, and 
hence the MLE of p, under the null hypothesis Ηρ, is x and then A = 1, and hence we 
do not reject Ηρ in this case. Hence the question of rejection comes only when the 
observed x falls below yo. But 


2 1-0 => μ-λχ-0. 
on 


13.3 Testing hypotheses on the parameters of a normal population Νίμ,σ2) —- 397 


Hence if y cannot take the value x, then we assign the closest possible value to x for py, 
which is uy under Ho : μ < µρ as well as for Hy : u 2 jg. [Note that this value can be 
assigned only because µρ is an admissible value under Ηρ or Ηρ contains that point. 
Hence in the open interval µ > µρ testing is not possible by using this procedure.] Then 
the MLE of σ’, under Ho, is given by 


1 n 
ô? =- Σα, a 
ΠΕ 


Substituting these MLE's for u and σ’, under Ηρ, we have 


ja OG τ x) 
Σρι - po)? 
> Xag - Ὁ" 
[3,65 Ὁ” + n --μο)] 
Εκ — 


n(x-ug)? 
[1 + το] 


Am 2 


This is a one to one function of the Student-t statistic 


VnX-u) 2 XMa0-x* 
[o αιώνας ο ων ον (13.9) 
5 π-1 
Since the criterion is constructed for X < 4o, the hypothesis Ηρ is rejected for small 
values of t, | or 


Prít, 4 κ -- =a. (13.10) 


Thus the criterion can be stated as follows: reject Ηρ :µ 2 ug When t, , < —t, 44 OF 
when the observed χ < Ug — πια +, where s? the unbiased estimator of o? is given in 
(13.9). Thus when c? is unknown the standardized normal test statistic changes into a 


Student-t statistic. 


Note 13.4. Note from the examples and discussions so far that the rejection region 
is always in the direction of the alternate hypothesis. When the alternate is 0 > 09, 
we reject at the right tail with probability a; when H; is 0 < 0, we reject at the left 
tail with probability a and when H; is 0 + 0 we reject at both the tails (right and 
left with probabilities a, and a5, such that αι + a; = a, but for convenience we take 
αι =a = $ each). 


Example 13.5. The following is the data on the time taken by a typist to type a page 
of mathematics in TEX: 20,30,25,41. Assuming that the time taken is normally dis- 
tributed Nj, 0?) with unknown c^, test the hypothesis Ηρ : 4 > Up = 30, Hy : H < uo, at 
the level of rejection 596. 


398 ----- 13 Tests of statistical hypotheses 


Solution 13.5. The sample mean x - 7 (20 + 30 + 25 + 41) = 29. Observed value of 


DL 05-x)? 2 
sp = A = FP. thea = fos = 2353. Mo = 30. Mo - ti as h = 30 2353222) - 


30 - 10.57 = 19.43. But the observed x = 29, which is not less than 19.43, and hence we 
cannot reject Ηρ, at a 596 level of rejection. 


In the above example, the observed value of the Student-t variable, t, , = 
n Ξ ο = -565 = -022. The observed value of the test statistic is -0.22 
and the rejection region is on the left tail. Hence we can compute the probability that 


at, 4 € -0.22. This is called the p-value for this example. 


Definition 13.6 (The p-values). Compute the observed value of the test statistic 
used for testing a hypothesis Ηρ. Let the test statistic be denoted by u and the 
observed value be denoted by uj. If the rejection region is on the right, then the 
p-value is Pr{u > uo}; if the rejection region is on the left (in this case usually uy will 
be negative if the statistic can take negative values) then the p-value is Pr(u x uo} 
and if the rejection region is at both ends then the p-value is Pr{u > |μρ|} + Pr{u < 
—|Uo|} if u has a symmetric distribution, symmetric about u = 0. 


The advantage of p-values is that, instead of a pre-fixed a, by looking at the 
p-values we can make decisions at various levels of a, and conclude at which level 
Hp is rejected and at which level Ηρ is not rejected. We can summarize the inference 
on u, when σ2 is unknown as follows: 


Case (4). Hy : u < ug, H, : H > po, population N (p, 0?), o? unknown. 
c n y) 
Test statistic: Ines ~t Sede 


Criterion: reject Ho if the observed t, | > t, τα or the observed value of x > ut + 


n-1» si = 
tn-1,a $ : 
Case (5). Hp : 4 2 uo, Hy : H < Mo, population N(j,0?), o? unknown. 

Test statistic: same as in Case (4); 

Criterion: reject Ηρ if the observed value of {ῃ , < -t, ια or the observed value of 


e, $1 
X < Ho- tna Va" 


Case (6). Hy : u = ug, H; :µ £ po, population Ν(μ, 07), o? unknown. 
Test statistic: same as in cases (4) and (5); 


Criterion: reject Ho if the observed value of | et) > f τα Or the observed 


+ Or X X Jio = πια +. as illustrated in Figure 13.7. 


value of X > uo + 
a 2 Uo (ni vn 


(ων ντο) 


ve ας | 


0 [nmm "n.o 0 


Figure 13.7: Illustration of Ηρ on p in N(u, 0?) when c? is unknown. 


13.3 Testing hypotheses on the parameters of a normal population Νίμ,σ2) —- 399 


13.3.3 Testing hypotheses on c? in a N(y, 07) 


Here, there are two cases to be considered. (a) when j is known and (b) when pis not 
known. The maximum of the likelihood function in the whole of the parameter space 
will be the following: 


n "TE 2 
max L- I enz EN la 
(μ,σ2)εΩ (2πς2) 5 " 


Replace µ by x if u is unknown. Let us take the case Ηρ: 0? < 0$, H} : 0? > 0$, where 
μ is known. Then what is the maximum of the likelihood function under this Hj? Let 
0 = 0°. The likelihood equation is the following: 


a 105 - uy 

—lnnLs0 => 60-$-20, s*= EE NE 

00 n 
Then if s? is an admissible value for 0 then 0 = 52, otherwise assign the closest possible 
value to s? for 0. The closest possible value to s? for 0 is σὀ for Hy : 0? < σὀ (given), as 
well as for Ηρ: σ > o2. But 


n 2 
Σια ; 1) 2 
uU Ξ -------------- ~ 
2 n 
σῇ 


a chi-square with n degrees of freedom. If u is unknown, then μι will be replaced by 
x when doing the maximization and in this case u ~ X2 τ. This is the only difference. 
Hence the A-criterion becomes 


λ- πο, uy (13.11) 


nn/2 


The shape of this A, as a function of u, is given in Figure 13.8. 


Figure 13.8: Shape of A-criterion as a function 
ofu= ?. 


We always reject for small values of A, that is, Pr(A < λρ|ῃρ) = a. As shown in Fig- 
ure 13.8, this statement is equivalent to the general probability statement for u < ug 


400 — 13 Tests of statistical hypotheses 


and u > u, as shown in Figure 13.8. But in the above case we reject only for large val- 
ues of s? compared to o2. Hence for the above case we reject only at the right tail. That 
is, we reject Ηρ if y? » χ} a . Similar procedures will yield the test criteria for the other 
cases. We can summarize the results as follows and illustration is given in Figure 13.9: 


Case (7). Ηρ: 0? < oj (given), H; : 0? > o2, population Νίμ, σ΄), y known. 
Test statistic: PESE S ~ X. (Replace the degrees of freedom n by n - 1 and u by 
X when yp is unknown Hence this situation is not listed separately. Make the changes 
for each case accordingly. When μ is known, we have the choice of making use of u or 
ignoring this information.) 


YR; 
Criterion: reject Hy if the observed value of LA YT 


Case (8). Ηρ: σ > 0$, H, : 0? < o2, population Ν(μ, 0°), y known. 
Test statistic: same as above. 
Yá05-9* 


Criterion: reject Ηρ if the observed cw S «xà iw 


Case (9). Hy : 0? = 02, Ηι : σ’ + oå, population Ν(μ, 0°), u known. 

Test statistic: same as above. 

Criterion: reject Ηρ if the observed y? < Xna-s or x7 > 15 . (Note from Figure 13.8 
that the cut off areas are not equal to 5 at both ends but for convenience we will take 
them as equal.) 


vi 06) 


κι 
NİR 


Figure 13.9: Testing hypotheses on c? in N(p, σ2), p known. 


When u is unknown, all steps and criteria are parallel. Replace u by x and the degrees 
of freedom n by n - 1 for the chi-square. If u is known but if we choose to ignore this 
information, then also replace u by x. 


Exercises 13.3 


13.3.1. By using the likelihood ratio principle, derive the test criteria for testing the 
following hypotheses on pina N (4, 0?) where o? is known, and assuming that a simple 
random sample of size n is available from this population and here μρ refers to a given 
number. 

(1) Ho : H= Ho, Hi: u > po (It is known beforehand that μ can never be less than Uy.) 
(2) Ho: B= uo, Hi: M « Mo 


13.4 Testing hypotheses in bivariate normal population —— 401 


(3) Ho : H= po, Hy : p < po (It is known beforehand that μ can never be greater 
than Uy.) 
(4) Ho : H= ug, H; : H £ Ho 


13.3.2. Repeat Exercise 13.2.1 for the following situations when o? is unknown: 
(1) Ho : H= po. Hy : p < Mo (It is known beforehand that μ can never be greater 
than Uy.) 


13.3.3. By using the likelihood ratio principle, derive the test criteria for testing the 

following hypotheses on o? in a N(j, 0?), assuming that a simple random sample of 

size nis available. Construct the criteria for the cases (a) μ is known, (b) u is unknown 

for the following situations, where 02 denotes a given quantity: 

(1) Ho : 0? = o$, H; : 0? > oj (It is known beforehand that σ can never be less 
than 04.) 

(2) Ho : 0? > 03, H; : 0? «οὖ 

(3 Ho : σ’ = 0, Ηι : σ᾽ < 0$ (It is known beforehand that o? can never be greater 
than 04.) 

(4) Ho : 0? = 03, H,: 07 + 03 


13.3.4. Illustrate all cases in Exercise 13.3.1 if u = 1kg and the observed data are the 
yields of tapioca on experimental test plots given by 5,10, 3,7kg. 


13.3.5. Illustrate all cases in Exercises 13.3.2 for the same data in Exercise 13.3.4. 


13.3.6. Illustrate all cases in Exercise 13.3.3 by using the data in Exercise 13.3.4 for 
the cases (a) μ = 1, (b) μ is unknown. Cut off equal areas at both tails, for conve- 
nience. 


13.4 Testing hypotheses in bivariate normal population 


In a bivariate normal distribution, there are five parameters. If (x1, x5) represents a bi- 
variate random variable, then the parameters are μι = E(x), u; = E(x;), Var(x;) = oj, 
Var(x>) = 02; p = correlation between x, and x,. If we have n data points, then the 
sample will be of the form X; = (35x53), i= 1, ..., n, where capital X; represents a vec- 
tor. If we have a simple random sample from (x,, x;), then Xj, i=1,...,n are iid vari- 
ables. In a practical situation, it may be possible to take observations but we may 
not have information about the five parameters. The situation may be that x, is the 
weight of an experimental animal before giving a special animal feed and x, may be 
the weight of the same animal after administering the special feed. Evidently, x, and 
X, are not independently distributed. Another example is the situation of adminis- 
tering a drug. x, may be the blood pressure before giving the drug and x; the same 
blood pressure after giving the drug. Our aim may be to test hypotheses of the type 


402 — 13 Tests of statistical hypotheses 


Hz «Μι OF Wy — gp, «0. Such hypotheses can be tested without knowing the five param- 
eters. 

Let us consider a more general situation of testing a hypothesis on ay, + by; + c 
where a,b,c are known constants such as 2j, - 3j; + 2 < 0. This can be tested be- 
cause we know that when (x,,x,) has a bivariate normal distribution, then all linear 
functions of x, and x, are also normally distributed. Hence if u = ax, + bx, + c, then 
u ~ N(,0?), where u = E(u) = ay, + bu, + c and o? = Var(u). Here, o? is usually un- 
known. Hence the procedure is the following. Convert the observations (Χι;, x;;) to ob- 
servations on u, namely 


ujcaxytbxj*c, i=1,....n, u-N(o?) 


where a, b, c are known. Let 


Then from Section 13.3.2, the test statistic will be a Student-t with n -- 1 degrees of 
freedom, given by 


vn(ü — ug) > 


ἔῃ. 13.12 
s n-1 ( ) 


where μρ is the hypothesized value of E(u). The test criteria, at the level of rejection a, 
are the following, where py = ay, + buy + C, a, b, c known: 


Case (10). Ηρ : 4 x Mo, H; : H > Mo; reject Ηρ if the observed value of viene 5 th ια OF 


e s 
u 2 Ho + tna n 


Case (11). Hy : M> uo, H; : H < Mo; reject Ho if the observed value of impe < -ta Aa 
ES S. 
Or U< uo — tr-a n 


Case (12). Ho : µΞ ug, Hy : u + pg; reject Ηρ if the observed value of JE 2 ty, 


- S 5 
> : < I 
OruZpuoí, 4s Fa Or S Ho πια η: 


The illustration is the same as the one in Section 13.3.2, Figure 13.7. Let o? = Var(u). 
Then we can test hypotheses on o? also by using the same data. 


Σια)” 


Case (13). Ηρ: 0° < σῦ, H; : 0? > 0$; reject Ho if the observed value of y; , = ---; 
0 


2 
Xn-1,0° 


Case (14). Ηρ: σ > 0$, H, : σ < 09; reject Hy if the observed value of the same y? , < 
Xe 
Case (15). Ηρ: 0? = oå, H; : 0? + 09; reject Ho if the observed value of the same x2 , < 


2 2 
Xn-11-8 OF 2Xn-1,4° 


13.4 Testing hypotheses in bivariate normal population —— 403 


Note 13.5. If (x;,...,X,) ~ N (ji, £), a p-variate multinormal then the same proce- 
dure can be used for testing hypotheses on the expected value and variance of any 
given linear function a,x, + --- + αρχρ + b, where αι.... T b are known, without 
knowing the individual mean values or the covariance matrix X. 


Example 13.6. The claim of a particular exercise routine is that the weight will be 
reduced at least by 5 kilograms (kg). A set of 4 people are selected at random from the 
set of individuals who went through the exercise routine. The weight before starting 
is χι and the weight at the finish is x;. The following are the observations on (x, Xx): 
(50, 50), (60,55), (70, 60), (70,75). Assuming a bivariate normal distribution for (x,, x) 
test the claim at a 2.596 level of rejection. 


Solution 13.6. Let μι = E(x,), Wy = E(x;). Then the claim is μι - p, 2 5. Let u = x, - x. 
Then the observations on u are 50 - 50 = 0, 60 - 55 = 5, 70- 60 = 10, 70 - 75 = —5 
and the observed ü - ;(0 +5 + 10 — 5) = 2.5. An observed value of st = H λα - 
u)? = $[(0 - 2.5)? + (5 - 2.5)? + (10 - 25)? + (-5 - 25)] = 12, s, = 6.45, Up = 5. Hence 
Ho - taa x = 5 ~ (50025 ££ = 5 — 3.182(3.225) = 5.26. The observed value of it = 2.25 
which is not less than —5.26, and hence the null hypothesis is not rejected. 


Remark 13.3. Our conclusion should not be interpreted as the claim being correct 
and it cannot be interpreted that we can “accept” the claim. The above result only 
indicates that the data at hand does not enable us to reject the claim. Perhaps other 
data points might have rejected the hypothesis or perhaps the normality assump- 
tion may not be correct thereby the procedure becomes invalid or perhaps the claim 
itself may be correct. 


Exercises 13.4 


13.4.1. Let x, be the grade of a student in a topic before subjecting to a special method 
of coaching and let x, be the corresponding grade after the coaching. Let E(x,) = μι» 
E(x) = p. The following are the paired observations on the grades of five indepen- 
dently selected students from the same set. (80, 85), (90, 92), (85,80), (60,70), (65,68). 
Assuming (x1, x5) to have a bivariate normal distribution test the following claims at 
5% level of rejection: (1) Wy > µη; (2) 24; - 344 < 2; (3) Spy - 24 < 3. 


13.4.2. Let t; be the body temperature of a patient having some sort of fever before 
giving particular medicine and let t; be the temperature after giving the medicine. 
Let μι = E(fj), My = E(t), 0? = Var(t, — t), 05 = Var(2t, — 3t,). Assume that (t,,t,) has a 
bivariate normal distribution. The following are the observations on (¢,, t) from 4 ran- 
domly selected patients having the same sort of fever. (101, 98), (100,97), (100, 100), 
(98,97). Test the following hypotheses at 5% level of rejection. (1) Ho : μ; «μι; 2) Ηρ: 
Hy < H +1; (3) Ho : i +1 2 pu; (4) Ho : 02 < 0.04; (5) Ηρ: of = 02; (6) Ηρ: 02 2 03. 


404 — 13 Tests of statistical hypotheses 


13.5 Testing hypotheses on the parameters of independent 
normal populations 


Let x, ~ N(1,,07) and x; ~ N(j5,03) and let x, and x, be independently distributed. 
This situation is a special case of the situation in Section 13.4. We can test hypotheses 
on the mean values and variances of general linear functions of x, and x; by using 
similar procedure as adopted in Section 13.4, that is, hypotheses on E(u) and Var(u), 
where u = αιχι + a5x; +b, αι» a,b are known constants. We will list here some standard 
situations such as Ηρ :μι - p3 = ὃ, for given 6. Let xy, ...νΧμι, and xj, ... ,x5,, be simple 
random samples of sizes n, and n, from x, and x,, respectively. Let 


ji ny j=l Ny j=l ny 
es (κ Xj) 2 _ Y Gy Xi) 2 _ Y (Xj -Xj) 
= » 514 = > 520 = ; 
jal n; Ep He 1 Ei 1 
[Djs Oy - κι} «ΣΙ - 35] 
Quo Ενα λα δη. (13.13) 
n +n, -2 


The likelihood ratio principle will lead to the following test criteria at a level of rejec- 
tion: 

Case (16). 07, 03 known, 6 given. Ηρ : uj - p < 6, Hy : Ju - Mp > ὃ; Test statistic: z = 
X780 ^ N(0,1); Criterion: reject Ηρ if the observed value of z > zg. 

Case (17). 02, 02, 6 known. Ηρ: μι - W 2 ὃ, H; : Μι — p < 6. Test statistic is the same z 
as above. Test criterion: reject Ηρ if the observed value of z < -z,. 


Case (18). 07, 03, 6 known. Ηρ : μι - Hy = ὃ, H, : y - po + ô. Test statistic is the same 
as above. Test criterion: reject Ηρ if the observed value of |z| > Ze. 


Illustration is the same as the ones in Section 13.3.1, Figure 13.5. 


Case (19). 0; = σὲ = o? (unknown), 6 given. Ηρ: My - Mo < 6, Hy : My — po > ô. Test statis- 
tic 


-t-l κ 
1 1 n,+n,-2 


or a Student-t with n, + n; - 2 degrees of freedom, where s? is given in (13.13). Test 


criterion: reject Ηρ if the observed value of t, +n,-2 = f 4, 55. 


Case (20). 0? = σὲ = o? (unknown), 6 given. Ho : μι - u = δ, H; : y, — p < δ. Test statis- 
tic is the same Student-t as above. Test criterion: reject Ηρ if the observed value of 
t 


nytn-2 = —thin-2a d 


13.5 Testing hypotheses on the parameters of independent normal populations —— 405 


Case (21). 0? = σὲ = σ (unknown), 6 given. Ηρ : μι — u5 = ô, H; : Wy - p; + δ. Test statis- 
tic is the same as above. Test criterion: reject Ηρ if the observed value of |t, ,,, | = 


(n «nj-2,2 : 


The illustration is the same as in Section 13.3.2, Figure 13.7. We can also test a 
hypothesis on a constant multiple of the ratio of the variances in this case. Consider a 
2 
typical hypothesis of the type Ηρ: 2 <n > 0 for a given η. Without loss of generality, 
2 
we can take η = 1 and write the hypothesis as σ < 1. Let us examine the likelihood ratio 
2 
principle here. Let 0 represent all the parameters 0 = (44,42,02, 02). The parameter 
space is 


θεΩ => Ω-{θ|-οο«μι«οο,0«σί «co,i 2 1,2]. 
The joint density function is given by 


L- |---τ x n PAREN 
(2102)? 
" | 1 = Sa SEP] 
(27102) 2 
In the parameter space Q, the MLEs are χη, Χ2,51,5ὲ which are given in (13.13). Hence 
the maximum of the likelihood function is the following: 
e inen) 
max L = —— — — —s&-. (13.14) 
OE — [Qrmsp? (27183) 2] 
Now, let us impose the hypothesis Hg : 02 < συ. The MLEs of y, and p, remain the same 
as ji, = x, and ĝ, = ž,. We can also write this Ηρ as of = 605, O < ô < 1. Then the joint 
density, at ji, = X, fl, =X, denoted by L4, becomes 


πι nino 


L = [(2n)"?" 62 (03) 2] 
2 
1 


xexp[-— [5 +n sil 
2051 6 τ 


where si and 52 are as defined in (13.13). Maximizing L, by using calculus we have the 
estimator for 02 as 


62 _ [πις1δ] + ΠΕΡ 
ΝΕΤ, 
Therefore, substituting all these values we have the A-criterion as the following. [In 
this representation of A, we have used the property that 
ní(n,-2s2 [YXj68-x)?/(Q -1)] 


τα : AP inis 13.15 
n(m -12sj [Xj065-3/j-)] ^" uu SED) 


406 --- 13 Tests of statistical hypotheses 


when 0; = 03, where Επ 4,4115 an F-random variable with n; - 1 and n; - 1 degrees 
of freedom. Then A is written in terms of F-variable, for convenience.] 


= (sp? ICH 2 
63 5 } (62)5 utm) 
n 
[Fs 1, 1] g 


“1 [1 F HU 
ΤΟ nin-il 


where c, and c; are positive constants. The nature of A, as a function of F, ,, , is 
given in Figure 13.10. 


Figure 13.10: A as a function of Fp τῃ,-1- 


Note that in the above H, and alternate we will not be rejecting for small values of 
g and we will reject only for large values or we will reject only for large values of 
F, 45,1. Hence the criteria can be stated as the following and illustration is given in 
Figure 13.11. 


VA (Fata) 


Vf (Facial) vf (Fyn) a 

Q α-- 2-- 2. 

Fh-tm-1,0 Fort ρω 1-6 NENI' 
Ho κιν  Qy2n —(-L 


Figure 13.11: Illustration of the critical regions. 


Case (22). Independent Νίμι, σή) and Νίμ., 02) populations. Ηρ : 4 <1, H; : οἱ 51. 
2 

Test statistic: F,. ,,, given in (13.15). Test criterion: reject Ηρ if the fobsevied value of 

F 1ΞΕ 


n-ln;- n,—1,n,-1,a° 
Case (23). Same independent normal populations as above. Hp : d > 1, H4: d «1. 
Test statistic is the same as above. Test criterion: reject H if the Asaka vek of 
Εη in B Επι 1,11 --α- 


Case (24). Same independent normal populations as above. Ho : 0? = 02, H} : 07 + 
02. Test statistic is the same as above. Criterion: reject Ho if the observed F EP NES 
Fn ias 8 OF Fs a 2 Fn: 


13.5 Testing hypotheses on the parameters of independent normal populations —— 407 


Note 13.6. The student will not find the lower percentage point for a F-density tab- 
ulated. This is because you can get lower percentage points from the upper percent- 
age points of another F-density. The connection is the following: 


1 
ESIE 


mn -^-mjnji-a 


= ο 
ΠΕ 


n,om,a 


and hence the lower percentage points in F, n are available from the reciprocal of 
the upper percentage points of F, ,. 


Note 13.7. If µι and μ; are known, then we can use μι and pp instead of their es- 
timators. In this case, replace F, , , by Fn, n,- In this situation, if we replace µι 
and u, by their estimators and proceeded as above, then also the procedure is valid. 
Hence in this situation we can use both F,, ῃ, as well as F,, _; n,-1 as test statistics. 


Example 13.7. Steel rods made through two different processes are tested for break- 
ing strengths. Let the breaking strengths under the two processes be denoted by x 
and y, respectively. The following are the observations on x: 5,10,12,3, and on y: 
8,15,10, respectively. Assuming that x ~ N (4,02) and y ~ N (15,02) and independently 
distributed, test the hypothesis at a 5% level of rejection that 0? < 0$. 


Solution 13.7. Here, according to our notations, n, = 4, n; = 3, the observed values of 
X-10410-1243) 57.5 -Χι, y = 3(8 + 15 +10) = 11 - 35, Xj G8; — 21) = (6-7.5) + 
(10-7.5)? + (12- 7.5)? + (3- 7.5? 253, Σι) x5)? = (8-11)? + (15-11)? + (10-11)? = 26. 
Therefore, the observed value of 


r EM C 


n-10m-1 ^32 (26/2) = 1.34. 


From a F-table, we have F; 2005 = 19.2. But 1.34 is not bigger than 19.2, and hence we 
cannot reject Ho. 


Exercises 13.5 


13.5.1. The yields of ginger from test plots, under two different planting schemes, are 
the following: 20, 25, 22, 27 (scheme 1), 23, 18, 28, 30, 32 (scheme 2). If the yields under 
the two schemes are independently normally distributed as N(u,,07) and Ν(μ., σ2), 
respectively, test the following hypotheses at a 596 level of rejection, where the alter- 
nates are the natural alternates or negation of the null hypotheses: (1) Hp :μι - u5 € 1, 
given that 0? = 2, 0$ = 3; (2) Ho : 24, - 34 23 given that σί = 05; (3) Ho : Wy = Μο, given 
that 0? = 02; (4) Ηρ: of < σὲ given that μι = O, ui; = —2; (5) Πρ: 0? = 02. 


13.5.2. Derive the A-criterion for the following cases and show that they agree with 
what is given in the text above, the alternates are the negation of the null hypotheses: 


408 — 13 Tests of statistical hypotheses 


(1) Ho : Wy 2 pp, given 02 = 5,02 = 2; 2) Ηρ: 2μι - p € 2, given that o? = 02; (3) Ηρ: 0? = 
σὲ, given that μι = 0, ui; = —5. 


13.6 Approximations when the populations are normal 


In Section 13.3.2, we dealt with the problem of testing hypotheses on the mean value 
u of a normal population when the population variance σ΄ is unknown. We ended up 
with a Student-t as the test statistic and the decisions were based on a f, ,, Student-t 
with n — 1 degrees of freedom. 


13.6.1 Student-t approximation to normal 


In many books, the student may find a statement that if the degrees of freedom are big- 
ger than 30, then read off the percentage points from a standard normal table, instead 
ofthe Student-t table, and Student-t tables are not usually available also for large val- 
ues of the degrees of freedom, beyond 30. But the student can use a computer and 
program it and compute to see the agreement of the tail areas with the tail areas of a 
standard normal density. You will see that even for degrees of freedom as large as 120 
the agreement is not that good. The theoretical basis is given in Exercise 13.6.1 at the 
end of this section. Hence when replacing Student-t values with standard normal val- 
ues this point should be kept in mind. Hence the approximation used is the following: 

u= VAH) ΞΝ(0,1), for large n (13.16) 
where s? could be the unbiased estimator for σ΄ or we may use the MLE itself. When n 
is large, dividing by n orn- 1 will not make much of a difference. Hence the test criteria 
can be stated as follows: 


Case (25). Population N (μ,σ2), o* unknown, n is very large. Ho : H € po (given), H; : 
H> Ho. Test statistic as given in (13.16). Criterion: reject Ηρ if the observed u > z, where 
z, is the upper 100a% point from a standard normal. 


Case (26). Population and situation as in Case (25). Hy : M 2 uo, Hy : u < po. Test statis- 
tic: same u as in (13.16). Criterion: reject Ho if the observed u < —z,. 


Case (27). Population details are as in Case (25). Hy : 4 = uo, H; : M + puo. Test statistic: 
same u in (13.16). Criterion: reject Ηρ when the observed value of |u| > Za. Illustration 
is as in Section 13.3.1, Figure 13.5. 


In the case of two independent normal populations x, ~ Νίμι,σί), and x; ~ 
Νίμ., σ2), and samples of sizes πι and n, and the notations as used in (13.12), we 
may have a situation where o? and σ2 are unknown. Then replace the variances by 
the MLEs οἱ and 52 and consider the following statistic: 


13.6 Approximations when the populations are normal —— {09 


jo τες N(0,1) (13.17) 


when n, is large and n; is also large. Again, the approximation is poor and as a crude 
rule, we may use the approximation in (13.14) when n, Σ 30, n; > 30. In the following 
cases, we list only the types of hypotheses on μι - μ for convenience, but remember 
that similar hypotheses on linear functions of j and p can be formulated and tested. 


Case (28). Independent normal populations N (1,07), N (45, 02), samples of sizes n, > 
30 and n, > 30, 07, σ2 unknown. Hy : μι - My < ὃ (given), H; : μι -p > δ. Test statistic: 
v in (13.14). Criterion: reject Hy if the observed value of v ΣΖα. 


Case (29). Same situation as in Case (28). Ho : μι - μ; = ô (given), ΗΙ : μι — p2 < ô. Test 
statistic: v as in (13.17). Criterion: reject Ηρ if the observed value of v < --ᾱι. 


Case (30). Same situation as in Case (28). Ηρ : Wy - My = ô (given), Hy : μι - p; 1 ô. 
Test statistic: same v as in (13.17). Criterion: reject Ηρ if the observed value of |v| > Za. 
Illustrations as in Section 13.3.1, Figure 13.5. 


In the case of independent normal populations, we may have a situation 0? = 02 = 
o? where the common σ΄ may be unknown. Then we have seen that we can replace 
this o? by a combined unbiased estimator as given in (13.13) and we will end up with 
a Student-t with n, + n; — 2 degrees of freedom. If this n, + n; -- 2 is really large, then 
we can have a standard normal approximation as follows: 
-ô 


ES 


where for the approximation to hold n, + n, - 2 has to be really large, as explained 
above. Then we can have the following test criteria: 


= N(0,1) (13.18) 


Case (31). Independent normal populations N (u, 0°), Νί(μ», σᾶ), 0? = σὲ = o? (un- 
known), n, +n, - 2 is very large. Ho : Wy - 4; < ô (given), H; : μι --μ > δ. Test statistic: 
w in (13.18). Criterion: reject Ηρ if the observed value of w > zg. 


Case (32). Situation as in Case (31). Hy : μι - Hy 2 ô (given), H; : μι - p < δ. Test statis- 
tic: w of (13.18). Criterion: reject Ηρ if the observed value of w < —z,. 


Case (33). Situation as in Case (31). Ηρ :μι - > = 6 (given), H; : μι — p; + 6. Test statis- 
tic: w as in (13.18). Criterion: reject Ηρ if the observed value of |w| > Za; See the illus- 
trations as in Section 13.3.1, Figure 13.5. 


13.6.2 Approximations based on the central limit theorem 


The centrallimit theorem says that whatever be the population, designated by the ran- 
dom variable x, as long as the variance 0? of x, is finite, then fora simple random sam- 
ple x,, ...,x,, the standardized sample mean goes to standard normal when n > oo. 


410 — 13 Tests of statistical hypotheses 


uon HO. Lange μα (13.19) 
σ 

where μρ is a given value of the mean value of x. Hence when 0? is known, we can test 

hypotheses on u if we have very large n. The approximate tests are the following: 


Case (34). Any population with known σ΄ < oo, n very large. H, : U < Mo (given), H; : 
H> Mo. Test statistic: u in (13.19). Criterion: reject Ηρ if the observed value of u > zg. 


Case (35). Any population with known finite variance, n very large. Hy : 4 > Mo (given), 
Hi : u < Uo. Test statistic: u in (13.19). Criterion: reject Ηρ if the observed value of 
U < --ᾱμ. 


Case (36). Any population with known finite variance, n very large. Test statistic: u in 
(13.19). Criterion: reject Ηρ if the observed value of |u| > Za; see the illustrations as in 
Section 13.3.1, Figure 13.5. 


In the approximation, σ must be known. If σ is not known and if o is replaced by 
an unbiased estimator for σ”, then we do not have a Student-t approximation even for 
large n. Hence the students are advised not to try to test hypotheses on p by estimating 
o? when the population is unknown. 


Exercises 13.6 


13.6.1. Take a Student-t density with v degrees of freedom. Take the limit when v > co 
and show that the Student-t density goes to a standard normal density when v -» co. 
[Hint: Use Stirling's formula on I(5) and r(t), Stirling’s formula says that when 
|2| > co and a a bounded quantity, then 


T(z+a)= Venz^**- ie, 


Takez = 5 when using Stirling's formula on Student-t density. The process (1+ yn >e 
is very slow, and hence no good approximation can come for small values of v. Even 
for v = 120, the approximation is not that close. Hence there should be caution when 
replacing a Student-t variable with a standard normal variable.] 


13.6.2. For a Student-t with v degrees of freedom, show that the h-th integer moment 
can exist only when v > h, and when v > h then all odd moments for h < v will be 
zeroes. 


13.6.3. Two different varieties of sunflower plants are planted on 50 test plots under 
variety 1 and 60 test plots under variety 2. The summary data, in our notations in the 
text above, (see equation (13.13)), are the following: x, = 20, x; = 25, s? =4, 57 Ξ 9, Test 
the following hypotheses, against the natural alternates, at a 596 level of rejection, 
assuming that x and y, typical yields under variety 1 and variety 2, respectively, are 
independently normally distributed: (1) Ho : μι - Wo 2 3, (2) Ho : Wo - Wy = 2 but it is 
known that 0? = 03 (unknown). 


13.7 Testing hypotheses in binomial, Poisson and exponential populations —— 411 


13.7 Testing hypotheses in binomial, Poisson and exponential 
populations 


In our binomial population, there are two parameters (n, p), O < p < 1and in our Pois- 
son population there is one parameter A. We will consider the problems of testing hy- 
potheses on p when n is known, which is also the same as testing hypotheses on p of 
a point-Bernoulli population, and testing hypotheses on the Poisson parameter À. 


13.7.1 Hypotheses on p, the Bernoulli parameter 


Consider n iid variables from a Bernoulli population. Then the joint probability func- 
tion, denoted by L is the following: 


L= pa - pO = pra - py, 


where x is a binomial random variable. The MLE of p in the whole parameter space 


is p = ~. Under Hy : 0 < p < po (given), A=1 if = < po. Hence we reject Ηρ only when 


Ξ > po or when x is large. The likelihood equation is 


η > p-Ž=0. 

Op n 
If 7 is an admissible value for p, then p = 7. Otherwise assign the closest value to 7 
for p, that is, the MLE under Hp is pọ for both Ηρ : p < po (given) and Hy : p = po. But 
x has a binomial distribution. How large should x be for rejecting Ηρ : p x po? Here, 
X 2 Xo such that 


n 


Pr(x2xglp-poglsa or Σ (o)ra nor «a. (13.20) 


X-Xp 


Look into binomial tables for p = py and compute x, of (13.20). If the sum of the bino- 
mial probabilities does not hit a exactly, then take the nearest x; such that the proba- 
bility is <a. For example, from the binomial tables, for n = 10, p, 0.40, Pr{x > 8|p = 
0.4} = 0.0123, Pr{x = 7|p = 0.4} = 0.0548. Hence if a = 0.05 then we take x, = 8. Hence 
the criteria can be stated as follows: 


Case (37). Bernoulli parameter p. Ηρ : p x po (given), H; : p > po. Test statistic: bino- 
mial variable x. Criterion: reject Ηρ if the observed number of successes is 2x, where 
Ya, (x POU -Ρο)” x a. 


Case (38). Bernoulli parameter p. Ηρ : p > po (given), H; : p < po. Test statistic: bino- 
mial variable x. Criterion: reject Ηρ if the observed number of successes «χι, such that 
πι (nYyX n-x 
xco(X)P61 - Po)" <a. 


412 — 13 Tests of statistical hypotheses 


Case (39). Bernoulli parameter p. H, : p = po (given), H; : p # po. Test statistic: bino- 
mial variable x. Criterion: reject Ho if the observed value of x < xy or x > x, where 


%2 (n να čín a 
νον 


χ-0 X-X3 
as illustrated in Figure 13.12. 
* * * 
x X * K ων * 
<a <a <$ «α 
* + + * 2 
* * * * $ * 
* * + 
[Xo ES Xj [X3 


Figure 13.12: Hypotheses Hy : p < po, Ho : P 2 po, Ho : p = Po. 


Example 13.8. Someone is shooting at a target. Let p be the true probability of a hit. 
Assume that p remains the same from trial to trial and that the trials are independent. 
Out of 10 trials, she has 4 hits. Test the hypothesis Hj : p x 0.45, H; : p > 0.45 at the 
level of rejection a = 0.05. 


Solution 13.8. The number of hits x is a binomial variable with parameters (p,n = 10). 
We reject Ηρ if the observed number of hits is >x), such that 


10 
10 
Pr(xzxoglp-0.45] «0.05 or Y ( : ) (0.45)*(0.65)197* < 0.05. 
ΧΞΧρ 
From a binomial table for n = 10 and p = 0.45, we have Pr{x = 8|p = 0.45} = 0.0274 and 
Pr{x 2 7|p = 0.45) = 0.102. Hence x, = 8 at a = 0.05. But our observed x is 4, which is 
not 28, and hence Ηρ is not rejected. 


Note 13.8. Note that the procedure in testing hypotheses on the binomial param- 
eter p is different from the procedure of establishing confidence intervals for p. In 
some of the tests on the parameters of the normal populations, it may be noted that 
the non-rejection intervals in hypotheses testing agree with some confidence inter- 
vals. Because of this, some authors mix up the process of constructing confidence 
intervals and tests of hypotheses and give credence to the statement “acceptance 
of Ηρ”. Itis already pointed out the logical fallacy of such misinterpretations of tests 
of hypotheses. Besides, tests are constructed by using some principles, such as the 
likelihood ratio principle, but confidence intervals can be constructed by picking 
any suitable pivotal quantity. Comments, similar to the ones here, also hold for the 
situations of tests of hypotheses and construction of confidence intervals in the 
Poisson case also, which will be discussed next. 


13.7 Testing hypotheses in binomial, Poisson and exponential populations —— 413 


13.7.2 Hypotheses on a Poisson parameter 


The Poisson probability function is f(x) = ge, 0 > 0 with the support x = 0,1,2,.... 
Let x,, ..., x, be iid variables from this Poisson population. [Usually the Poisson param- 
eter is denoted by À, but in order to avoid confusion with the A of the A-criterion, we 
will use 6 here for the Poisson parameter.] The joint probability function of x,, ... , x, is 


Qt! eno 


x, Xa! 


and the MLE of 0 in the whole parameter space 15 Ó = Hes +: + Xn) =X. Consider the 
hypothesis: Ηρ : 0 < @ (given), H; : 0 > 0p. If x falls in the interval 0 < x < θῃ, then the 
MLE in the parameter space © = (0 | O < 0 < co} as well as under Ηρ coincide and the 
likelihood ratio criterion A = 1, and we do not reject Ηρ. Hence we reject Ηρ only when 
X > 65. The likelihood equation 


η => 6-χ-0. 

00 
Hence if x is an admissible value for 0, then take 6 = x, otherwise assign the closest 
possible value to x as the estimate for 0. Hence the MLE, under Ηρ : 0 < 6, (given) as 
well as for Ηρ : 0 > 05, is 05. Hence the test criterion will be the following: Reject Ηρ 
if the observed value of x is large or the observed value of u = x, + --- + x, is large. But 
u is Poisson distributed with parameter n0. Therefore, we reject Ηρ : 0 < 05 if u > ug 
such that Pr(u > ug|0 = 05) x a or 


oo u uo-1 u 
Σ (n69)" .. ig, <a or Σ (n89)" «-πθυ 2 1-α 


! 
U=Ug i u=0 u: 


where 
u = (X4 + = + Xn) ~ Poisson(n6,). (13.21) 


We can summarize the results as follows: 


Case (40). Parameter 0 in a Poisson distribution. Ηρ : 0 < 0, (given), H; : 0 > θρ. Test 
statistic: u in (13.21). Criterion: reject Πρ if the observed value of u > uy such that Pr{u > 
uo |θ = Oo} < a. 


Case (41). Poisson parameter 0. Ηρ : 0 > 0, (given), H, : 0 < 0$. Test statistic: u 
in (13.21). Criterion: reject Ηρ if the observed value of u x u, such that Pr{u < u,|0 = 
Oo} x a. 


Case (42). Poisson parameter 0. Ηρ : 0 = 0, (given), H, : 0 + Op. Test statistic: u in 
(13.21). Criterion: reject Πρ if the observed value of u < u, or >u3, such that Pr(u < 


414 ---- 13 Tests of statistical hypotheses 


* * + + + + 
+ * n * a |” á α 
«κ» «S 
a κα x =2 + 712, 
* «xo * * 
* * * 
* * + * 
ke uu » Eoy z ny ο 
[Uo uil 12 [us 
Ho : 0 € bo, Ho : 0 > 9%, Ho : 0 = bo 


Figure 13.13: Hypotheses in a Poisson population. 


w l0 = 0o} < 5 and Pr{u > u3|0 = 091 < S. (Equal cut off areas at both ends are taken 
for convenience only.) For illustration see Figure 13.13. 


Example 13.9. The number of snake bites per year in a certain village is found to be 
Poisson distributed with expected number of bites 0. A randomly selected 3 years gave 
the number of bites as 0, 8, 4. Test the hypothesis, at 5% level of rejection, that 0 «2. 


Solution 13.9. According to our notation, the observed value of u = (0 + 8 + 4) = 12, 
05 22, N= 3, n0, = 6, a= 0.05. We need to compute uy such that 


co 6" Ug-1 6" 

) —e%<0.05 2 Y —e$2095 
u! u! 

U=Ug u=0 


From the Poisson table, we have ug — 1 = 10 or ug = 11. Our observed u is 12 > 11, and 
hence we reject Ho. 


13.7.3 Hypotheses in an exponential population 
The exponential population is given by the density 


e?, x>0,0>0 


f= 


Slr 


and zero elsewhere. Consider x, ...,x, iid variables from this population. Then the 
joint density function of x,,...,x,, is 


L= Le bitte), 
n 
The likelihood equation is given by 


S nL-0 => ϐθ-κ-ο. 


Hence if x is an admissible value for 0 then 6 = x, otherwise assign for 0 the admissible 
closest value to x. Therefore, the MLE in the whole parameter space is Ó = x. But for 


13.7 Testing hypotheses in binomial, Poisson and exponential populations —— 415 


Hy : 0 < 0, (given) as well as for Ηρ : 0 > 05 the MLE, under Ηρ, is 05. In Ho : 0 < 0, 
we reject only for x > θῃ (otherwise A = 1) and in the case Ηρ: 0 > 0, we reject only 
when X < 6). Hence for Ηρ : 0 < 0, we reject for large values of x or for large values of 
u = QG + +++ + Xn). When x; has exponential distribution, u has a gamma distribution 
with parameters (a = n, B = 09), under Ηρ, or v = a is gamma with parameters (n, 1) or 
w= & ~ X2, a chi-square with 2n degrees of freedom. Hence for testing the hypothesis, 
we can use either u or v or w. Hence the criteria are the following, given in terms of a 
chi-square variable. 


Case (43). Exponential population with parameter 0. Ηρ : 0 < 0, (given), H; : 0 > θρ. 


Test statistic: w = Pu) ~X3n- Criterion: reject Ηρ if the observed value of w > X, ,.. 


Case (44). Same population as in Case (43). Ho : 0 360 (given), H; : 0 < θρ. Test statis- 
tic is the same w as in Case (43). Criterion: reject Ηρ if the observed value of w < 5,4 ,. 


Case (45). Same population as in Cases (43). Hy : 0 = 0, (given), H; : 0 + 0. Test statis- 
tic is the same w as above. Criterion: reject Ηρ if the observed value of w < Xin or 
w> Kns . [Equal tail areas are taken only for convenience.] The illustrations for Cases 
(43), (44), (45) are given in Figure 13.9, with degrees of freedom 2n. 


Exercises 13.7 


13.7.1. In a drug-testing experiment on mice, some mice die out before the experiment 
is completed. All mice are of identical genotype and age. Assuming that the probability 
of a mouse dying is p and it is the same for all mice, test the hypotheses (1) p x 0.4, 
H; : p > 0.4; (2) Ho : p = 0.4, H; : p + 0.4, at a 5% level of rejection, and based on the 
data in one experiment of 10 mice where (a) 5 died, (b) 4 died. 


13.7.2. The number of floods in Meenachil River in the month of September is seen 
to be Poisson distributed with expected number 0. Test the hypotheses, at 596 level of 
rejection, that (1) Hp :0 = 2, H; : 0 22, (2): Ho : 0 2 2, H; : 0 «2. Two years are selected 
at random. The numbers of floods in September are 8,1. 


13.7.3. A typist makes mistakes on every page she types. Let p be the probability of 
finding at least 5 mistakes in any page she types. Test the hypotheses (1) Hp : p = 0.8, 
H; : p #0.8, (2) Ho : p 2 0.8, H; : p < 0.8, ata 5% level of rejection, and based on the fol- 
lowing observations: Out of 8 randomly selected pages that she typed, 6 had mistakes 
of more than 5 each. 


13.7.4. Lightning strikes in the Pala area in the month of May is seen to be Poisson 
distributed with expected number 0. Test the hypotheses (1) Hy : 0 < 3, H; : 0 > 3; 
(2) Ho : 0 =3, H; : 0 #3, at a 5% level of rejection, and based on the following observa- 
tions. Three years are selected at random and there were 2, 2, 3 lightning strikes in May. 


416 — 13 Tests of statistical hypotheses 


13.7.5. The waiting time in a queue at a check-out counter in a grocery store is as- 
sumed to be exponential with expected waiting time 0 minutes, time being measured 
in minutes. The following are observations on 3 randomly selected occasions at this 
check-out counter: 10,8,2. Check the following hypotheses at a 5% level of rejection: 
(1) Ho : 0 <2, H; : 0> 2; Q Ho : 0 =8, H:028. 


Remark 13.4. After having examined various hypotheses and testing procedures, 
the students can now evaluate the whole procedure of testing hypotheses. All the 
test criteria or test statistics were constructed by prefixing the probability of rejec- 
tion of Hy under Ηρ, namely a, and the probability of rejection of Ηρ under the 
alternate, namely 1 -- f. Thus the whole thing is backing up only the situation of 
rejecting Ho. If Ηρ is not rejected, then the theory has nothing to say. In a practical 
situation, the person calling for a statistical test may have the opposite in mind, to 
accept Ηρ but the theory does not justify such a process of “accepting Ho". 


There are two standard technical terms, which are used in this area of testing of 
statistical hypotheses. These are null and non-null distributions of test statistics. 


Definition 13.7 (Null and non-null distributions). The distribution of a test statis- 
tic A under the null hypothesis Ηρ is called the null distribution of A and the distri- 
bution of A under negation of Ηρ is called the non-null distribution of A. 


Observe that we need the null distribution of a test statistic for carrying out the 
test and non-null distribution for studying the power of the test so that the test can be 
compared with others tests of the same size a. 


13.8 Some hypotheses on multivariate normal 


In the following discussion, capital Latin letters will stand for vector or matrix vari- 
ables and small Latin letter for scalar variables, and Greek letters (small and capital) 
will be used to denote parameters, a prime will denote the transpose. We only consider 
real scalar, vector, matrix variables here. Let X be p x 1, X' = (x, ... »Χρ). The standard 
notation Νγί(μ,Σ), Σ > 0 means that the p x 1 vector X has a p-variate non-singular 
normal distribution with mean value vector j' = (4, ... 4j) and non-singular positive 
definite p x p covariance matrix X, and with the density function 


1 1 

f(X) = —— exp[-50 - z^ a - | (1322) 
(2m)? |X|? 2 

where -οο < pi; < 00, -co < x «oo, iz 1,...,p, Z- (oj) = Σ’ > 0. If we have a simple 

random sample of size N from this Νγ(μ, 2), then the sample values and the sample 

matrix can be represented as follows: 


13.8 Some hypotheses on multivariate normal —— 417 


Xi "ano KN 
Xj x 
X;=| : |; i=1,...,N; sample matrix X = | ?! ὯΝ, 
Xy : 
pi 
Χρι ΧΡΝ 


The sample average or sample mean vector, a matrix of sample means and the devia- 
tion matrix are the following: 


χι Ν 
- Lax 
X=]: |, x, =e E 
` N 
p 
Xx Xx 
x em X 
Xp Xp 
Xj -X χιν -Χι 
E Xy -X e Xy- X 
Xaa 4. ΝΕ (3.23) 
Xpi 7 Xp XpN “Xp 


Then the sample sum of products matrix, denoted by S, is given by the following: 
S = [X — X][X — ΧΙ = (Si) 
where 
N N 
5η = X Ox -χρακ-χρ, Su- Y Ox -X;). (13.24) 
k=l k1 


Then the joint density of the sample values, for simple random sample of size N (iid 
variables), denoted by L(X, m, X) is the following: 


N 
LX, u, 2) - | [fX 
jl 


—— 
Q2)? |Z|2 
1 N 
«exp |-3 Σα, -μσ]α -»}. (13.25) 
j=l 


The exponent can be simplified and written in the following form by using the fact 
that the trace of a 1 x 1 matrix or trace of a scalar quantity is itself. Also we have a 
general result on trace, namely, tr(AB) = tr(BA) as long as the products AB and BA are 
defined, AB need not be equal to BA. 


418 —— 13 Tests of statistical hypotheses 


N N 
Σα; - -μ)’ ΣΩ; -w= Y -wz -») 


j=1 j=1 


N 
- Y tr(X; -W X-u) 
j=1 


N 
=X [z -W - 1] 


nerf σα - μα, - -w'}} 


j=1 


£N - yr (X - p). (13.26) 


Then the maximum likelihood estimators (MLE) of u and = are 


jsX, Ez x (13.27) 


Those who are familiar with vedor and matrix derivatives may do the following. Take 
In L(X, p, £), operate with 2 au and 2 55; equate to null vector and null matrix respectively 
and solve. Use the form as in (13.26), which will provide the solutions easily. For the 
sake of illustration, we will derive one likelihood ratio criterion or A-criterion. 


13.8.1 Testing the hypothesis of independence 
Let the hypothesis be that the individual components x4, ... ,x, are independently dis- 
tributed. In the normal case, this will imply that the covariance matrix X is diago- 
nal, X = diag(0,,, ...,05,), where oj; = σὲ = Var(xj), i -- 1,...,p. In the whole parame- 
ter space Q, the MLE are fi = X and Ê = y andthe supremum ofthe likelihood function 
is 

EE depu 

i Qn) ? |y]? 


Under the null hypothesis of independence of components, the likelihood function 
splits into product. Let Lọ denote L under Ηρ. Then 


x Yi Hy)? 


jsp 


ja [o2]? 


which leads to the MLE as ji; = X;, 07 = 3}. Hence 


sup L = ——_—___—_... 
Hy — Qn? Rô)? 


13.9 Some non-parametric tests —— 419 


Hence 


where c is a constant. The null distribution of u = cÀ* is the distribution of u when 
the population is N,(j, 2) = II N (uj, 0j) and the non-null distribution of u is the dis- 
tribution of u when the population is N, (y, 2), X > 0. These can be evaluated by using 
the real matrix-variate gamma distribution of S. This material is beyond the scope of 
this book. 


Exercises 13.8 


13.8.1. If the p x1 vector X has a non-singular p-variate normal distribution, Np (wÈ), 
X > 0, write down the densities when (1) X = a diagonal matrix, (2) X = σ᾿] where o? 
is a scalar quantity and I is the identity matrix, (3) X is a block diagonal matrix X = 
diag(Zi, ..., 244), q € p. 


13.8.2. If a simple random sample of size N is available from a Np (u, È), X > 0, then 
derive the MLE of u and È. 


13.8.3. Forthe problem in Exercise 13.8.1, derive the MLE of the parameters under the 
special cases (1), (2) and (3) there. 


13.8.4. Fill in the steps in the derivation of the result in (13.26). 


13.8.5. Construct the A-criterion for testing the hypotheses that X is of the form of (2) 
and (3) of Exercise 13.8.1. 


13.9 Some non-parametric tests 


In previous sections, we have been dealing with hypotheses on the parameters of a 
pre-selected distribution. In other words, we have already selected a model (density 
or probability function) and we are concerned about one or more parameters in this se- 
lected model. Now, we will consider situations, which are not basically of this nature. 
Some parameters or models may enter into the picture at a secondary stage. Suppose 
that we want to test whether there is any association between two qualitative char- 
acteristics or two quantitative characteristics or one qualitative and one quantitative 
characteristic, such as the habit of wearing a tall hat (qualitative) and longevity of life 
(quantitative), intelligence (qualitative/quantitative) and weight (quantitative), color 
of eyes (qualitative) and behavior (qualitative), preference of certain types of clothes 
(qualitative) and beauty (qualitative), etc., then this is not of a parametric type that 
we have considered so far. Suppose that we have some data at hand, such as data on 


420 — 13 Tests of statistical hypotheses 


heights of students in a class and we would like to see whether height is normally dis- 
tributed. Such problems usually fall in the category of tests known as “lack-of-fit” or 
“goodness-of-fit” tests. Here, we are concerned about the selection of a model for the 
data at hand. In a production process where there is a serial production of a machine 
part, some of the parts may be defective in the sense of not satisfying quality speci- 
fications. We would like to see whether the defective item is occurring at random. In 
a cloth weaving process, sometime threads get tangled and uniformity of the cloth is 
lost and we would like to see whether such an event is occurring at random or not. 
This is the type of situation where we want to test for randomness of an event. We will 
start with lack-of-fit tests first. 


13.9.1 Lack-of-fit or goodness-of-fit tests 


Two popular test statistics, which are used for testing goodness-of-fit or lack-of-fit 
of the selected model to given data, are Pearson's X? statistic and the Kolmogorov- 
Smirnov statistic. We will examine both of these procedures. The essential difference 
is that for applying Pearson's X? statistic we need data in a categorized form. We do 
not need the actual data points. We need only the information about the numbers of 
observations falling into various categories or classes. If the actual data points are 
available and if we wish to use Pearson's X^, then the data are to be categorized. This 
brings in arbitrariness. If different people categorize the data they may use different 
class intervals and these may result in contradictory conclusions in the applications 
of Pearson's X? statistic. The Kolmogorov-Smirnov statistic makes use of the actual 
data points, and not applicable to the data which are already categorized in the form 
of a frequency table. [In a discrete situation, actual data will be of the form of actual 
points the random variable can take, and the corresponding frequencies, which is not 
considered to be a categorized data.] This is the essential difference. Both of the pro- 
cedures make use of some sort of distance between the observed points and expected 
points, expected under the hypothesis or under the assumed model. Pearson's X? and 
the Kolmogorov-Smirnov statistics make use of distance measures. Pearson's statistic 
is of the following form: 


k 2 
x-y Orme): -- : (13.28) 


i=l i 
where o; = observed frequency in the i-th class, e; = the expected frequency in the i-th 
class under the assumed model, for i = 1,...,k. It can be shown that (13.28) is a gen- 
eralized distance between the vectors (0,,...,0,) and (e;,..., €x). It can also be shown 
that X? is approximately distributed as a chi-square with k — 1— s degrees of freedom 
when s is the number of parameters estimated while computing the e;'s. If no param- 
eter is estimated, then the degrees of freedom is k - 1. This approximation is good only 


13.9 Some non-parametric tests —— 421 


when k > 5, e; > 5 for each i = 1, ..., k. Hence (13.28) should be used under these condi- 
tions, otherwise the chi-square approximation is not good. The Kolmogorov-Smirnov 
statistics are the following: 


D, = sup|S, (x) - F(x)| (13.29) 
and 
W? = E|S,(x) - Fe) (13.30) 


where S, (x) is the sample distribution function (discussed in Section 11.5.2) and F(x) 
is the population distribution function or the distribution function under the as- 
sumed model or assumed distribution. In W?, the expected value is taken under the 
hypothesis or under F(x). Both D, and ΠΩ͂Σ are mathematical distances between 
S, (x) and F(x), and hence Kolmogorov-Smirnov tests are based on actual distance 
between the observed values represented by S, (x) and expected values represented 
by F(x). For example, suppose that we have data on waiting times in a queue. Our 
hypothesis may be that the waiting time t is exponentially distributed with expected 
waiting time 10 minutes, time being measured in minutes. Then the hypothesis Ho 
says that the underlying distribution is 


1 
Ho: fœ)= TES t > 0 and zero elsewhere. (13.31) 


Then, under Ho, the population distribution function is 


x Xe 1σ x 
ΡΟ foars [ης ας -ι- ens. 

a o 10 
Suppose that the number of observations in the class 0 < t < 2is 15 or it is observed that 
the waiting time of 15 people in the queue is between 0 and 2 minutes. Then we may 
say that the observed frequency in the first interval o, Ξ 15. What is the corresponding 
expected value e}? Suppose that the total number of observations is 100 or waiting 
times of 100 people are taken. Then the expected value e, = 100p,, where p, is the 
probability of finding a person whose waiting time t is such that 0 < t < 2. Under our 
assumed model, we have 


2e 
e - 100 | 
0 


2 


a5] = 100[1- e75]. 


a5 z 
dt =100[1-e 
10 


In general, we have the following situation: 


Classes 1 2 E k 
Observed frequencies n Ny ass πι 
Expected frequencies e;=np, e;-np, ... ey-np, 


422 —— 13 Tests of statistical hypotheses 


where n = n, + -+ + ng = total frequency and pj is the probability of finding an obser- 
vation in the i-th class, under the hypothesis, i 2 1, ..., k, p4 + ++ +p, = 1. Then 


k 
pey Qucm Le (13.32) 
"& πρ 7 Xk-r- : 
i- 


Fork > 5, np; 2 5, i =1, ... ,k we have a good approximation in (13.26). 


AG.) 


iom 


Xia α 


Figure 13.14: Critical region in Pearson's X? test. 


When using a distance measure for testing the goodness-of-fit of a model, we will al- 
ways reject if the distance is large or we reject when the observed X? is large. Hence, 
when using X? statistic the criterion is to reject Ho if the observed X? > Xk-1 for a test 
at the level of rejection a as shown in Figure 13.14, where 


Pr{X? 2 Yan] =a. 


Example 13.10. A tourist resort is visited by tourists from many countries. The resort 
operator has the following data in the month of January: 


Country USA UK Canada Italy Germany France Asia 
Frequency 22 12 18 10 20 18 30 


The resort operator has a hypothesis that the proportions of tourists visiting in the 
month of January of any year is 2:1:2:1:2:2:3. Test this hypothesis, at a 5% level 
of rejection. 


Solution 13.10. The total frequency n = 22 + 12+ 18 + 10 + 20 + 18 + 30 = 130. Under 
the hypothesis, the expected frequencies are the following: For the USA, it is 2 out of 
13 =24+14+24+14+2+2+3 of 130 or 130 x ὦ = 20. Calculating the expected frequencies 
like this, we have the following table: 


Observed frequency 22 12 18 10 20 18 30 
Expected frequency 20 10 20 10 20 20 30 


The degrees of freedom for Pearson's X? is k - 1-7 -1 = 6. From a chi-square table, 
χόοος = 12.59. Also our conditions k > 5, e; > 5, i = 1,...,7 are satisfied, and hence a 
good chi-square approximation can be expected. The observed value of X? is given by 


13.9 Some non-parametric tests —— 423 


(20-207 (12-10)? (18-20)? 
= + + 


x? 
20 10 20 
2 2 2 2 
, (0-107 _ (20-20) (18-20) GO - 30) 
10 20 20 30 


vM qe oW c M Eg τος T 
20 10 20 20 20 


The observed value of X? is not greater than the tabulated value 12.59, and hence we 
cannot reject Ηρ. Does it mean that our model is a good fit or our hypothesis can be 
*accepted"? Remember that within that distance of less than 12.59 units there could 
be several distributions, and hence “accepting the claim" is not a proper procedure. 
For example, try the proportions 3: 1:1:1:2:2:3 and see that the hypothesis is not 
rejected. At the most what we can say is only that the data seem to be consistent with 
the hypothesis of resort owner's claim of the proportions. 


Note 13.9. In statistical terminology, our hypothesis in Example 13.10 was on the 
multinomial probabilities, saying that the multinomial probabilities are p, = 4, 
P= 5» e. P7 = EF in a 6-variate multinomial probability law. 


Example 13.11. Test the goodness-of-fit of a normal model, x ~ N(u = 80, o° = 100), 
x = grade obtained by students in a particular course, to the following data, at 5% 
level of rejection. 


Classes x<50 50<x<60 60<x<70 
Frequency 225 220 235 

Classes 70<x<80 80<x<90 90<x<100 
Frequency 240 230 220 


Solution 13.11. Total frequency n = 1370. Let 


fnis x-80 


1 1 2 1 2 
= (x-80) ES = -γ΄/2 
------ε 2 ; ----: -——e é 
ον — yr ago 89) γ2π 


The probability p, of finding an observation in the interval —co < x < 50 is given by 


50 3 
D- [ f(x)dx = | g(y)dy = 0.0014 


-οο 


50-80 _ 


from Ν(0,1) tables, where y = το = 73. We have the following results from the com- 


putations: 


p, = 0.0014 
e, = np, = 1370 x 0.0014 = 3.84 


60 -2 
22” | f(x)dx = | g(y)dy = 0.4998 — 0.4773 = 0.0213 
50 3 


e; = np, = 1370 x 0.0213 = 29.18 


424 — 13 Tests of statistical hypotheses 


70 -1 
p= | f(x)dx = | g(y)dy = 0.4773 — 0.3414 = 0.1359 
60 2 


e> = np, = 1370 x 0.1359 = 186.18 
80 0 
P= | f(x)dx = | g(y)dy = 0.3414 - 0 = 0.3414 
70 -1 
e, = np, = 1370 x 0.3414 = 467.72 
90 1 
Ds = | f(x)dx = | g(y)dy = 0.3414 = 0.3414 
80 0 


€; = np, = 1370 x 0.3414 = 467.72 


100 2 
ΡΕΞ i fGodx = | g(y)dy = 0.1359 
1 
es = Mpg = 1370 x 0.1359 = 186.18. 


Since e, « 5, we may combine the first and second classes to make the expected fre- 
quency greater than 5 to have a good approximation. Then add up the observed fre- 
quencies in the first two classes 0, + 0) = 225 + 220 = 445, call it ο) and expected fre- 
quencies e, + e; = 3.84 + 29.18 = 33.02 = e). Thus, the effective number of classes is now 
6-1-55still the condition k 25 is satisfied. The degrees of freedom is reduced by one 
and the new degrees of freedom is Κ' - 126—125, Κ' = k - 1. The tabulated value of 
Xk -ia =Xé.0.05 = 9.49. The observed value of X? is given by 


yi. (446.33 - 33.022  (235-18618) (240- 467.72)" 


33.02 186.18 467.72 
(230 -- 467.72)? Ν (220 -- 186.18)? 
467.72 186.18 ; 


We have to see only whether the observed X? is greater than 9.49 or not. For doing 
this, we do not have to compute every term in X?. Start with the term involving the 
largest deviation first, then the second largest deviation, etc. The term with the largest 
deviation is the first term. Hence let us compute this first: 


(445 — 33.02)? 
33.02 
and hence the hypothesis is rejected. We do not have to compute any other term. 


» 9.49 


Note 13.10. When this particular normal distribution with jj = 80 and o? = 100 is re- 
jected that does not mean that other normal distributions are also rejected. We can 
try other normal populations with specified μ and σ’. We can also try the goodness- 
of-fit of any normal population. In this case, we do not know μ and o?. Then we 
estimate u = ji and o? = 6? and try to fit Ν(ῇ, δ). MLE or moment estimates can be 
used. In this case, two degrees of freedom for the chi-square are lost. In our exam- 
ple, the effective degrees of freedom will be then k’ -1-225-2-3, k’ =k -1 due 
to combining two classes, and then our approximation is not good also. 


13.9 Some non-parametrictests —— 425 


Note 13.11. The testing procedure based on Pearson's X? statistic and other so 
called “goodness-of-fit” tests are only good for rejecting the hypothesis, which 
defeats the purpose of the test. Usually, one goes for such tests to claim that the 
selected model is a good model. Testing procedure defeats this purpose. If the 
test did not reject the hypothesis, then the maximum claim one can make is only 
that the data seem to be consistent with the hypothesis or simply claim that the 
discrepancy, measured by X?, is small, and hence we may take the model as a 
good model, remembering that several other models would have given the same or 
smaller values of X?. 


Note 13.12. In our problem, the variable x is a Gaussian random variable, which 
ranges from --οο to co. But our data are only for x < 100. Logically, we should have 
taken the last cell or last interval as 90 x x < oo or x 290, instead of taking 90 < 
x x 100. 


13.9.2 Test for no association in a contingency table 


A two-way contingency table is a table of frequencies where individuals or items are 
classified according to two qualitative or two quantitative or one qualitative and one 
quantitative characteristics. For example, if a random sample of 1,120 people are clas- 
sified according to their ability to learn statistics and their weights and suppose that 
we have the following two-way contingency table: 


Example 13.12. Test for no association between the characteristics of classification in 
the following two-way contingency table, where 1, 120 people are classified according 
to their ability to learn statistics and weights. Test at a 596 level of rejection, where 
I = excellent, IJ = very good, III = good, IV = poor, W; = «50 kg, W, = 50 -- 60 kg, W; = 
»60 kg. 


Weight ^ Wi W, W; 
Ability | 
I 50-ny(pg) 100=npy(py) 120-njg(pi) 
Π 100-nj(pg) 1205202) 80-T5(p5) 
ΠΙ 80-n4(ps) 90-πρίρῳ) 100-πῃ{ρῃ) 


IV 90-n4(p4) 100-7454) 90 =n, Paz) 
Sum n,=320(p,) n,=410(p,)  n4-390(p) 


Due to overflow in the page, the last column is given below: 


426 —— 13 Tests of statistical hypotheses 


Row sum 
n, -270(p1) 
n, -270{ρ2) 
n, = 270(p3 ) 
n4, = 280(p, ) 
n -1120(p =1) 


Here, the number of people or frequency in the i-th ability group and j-th weight group, 
or in the (i,j)-th cell, is denoted by n,. The probability of finding an observation in the 
(i,j)-th cell is denoted by pj;. These are given in the brackets in each cell. The follow- 
ing standard notations are used. These notations will be used in model building also. 
Asummation with respect to a subscript is denoted by a dot. Suppose that i goes from 
1to m and j goes from 1 to n. Then 


n m m n 
πι = >. nj; Nj = Y nj; y Y nj-n - grand total. (13.33) 
ΕΙ i=l i=1 j=1 
Similar notations are used on pj;'s so that the total probability p. = 1. In the above 
table of observations, the row sums of frequencies are denoted by n, n, , ,n,, and 
the column sums of frequencies are denoted by n ,, 5, n3. 


Solution 13.12. If the two characteristics of classification are independent or in the 
sense that there is no association between the two characteristics of classification, 
then the probability in the (i,j)-th cell is the product of the probabilities of finding an 
observation in the i-th row and in the j-th column or p, = p;p j. Note that in a multi- 
nomial probability law the probabilities are estimated by the corresponding relative 
frequencies. For example, p; and p j are estimated by 


ῃ πι 
Boh, puc (13.34) 


and, under the hypothesis of independence of the characteristics of classification, the 
expected frequency in the (i,j)-th cell is estimated by 


egy-np;p;-n x = x x - us (13.35) 
Hence, multiply by the marginal totals and then divide by the grand total to obtain the 
expected frequency in each cell, under the hypothesis of independence of the charac- 
teristics of classification. For example, in the first row, first column or (1,1)-th cell the 
expected frequency is 270x320 = 77.14. The following table gives the observed frequen- 
cies and the expected frequencies. The expected frequencies are given in the brackets. 

50(77.14) | 100(98.84) 120(94.02) 

100(85.71) 120(109.82) 80(104.46) 

80(77.14) | 90(98.84)  100(94.02) 

90(80) 100(102.5) 90(97.5) 


13.9 Some non-parametric tests ---- 427 


Then an observed value of Pearson’s X? statistic is the following: 


> (50-7744) , 00 - 98.84)? , Q20- 94.02)? 
77.14 98.84 94.02 
(100 -- 85.71)? a (120 - 109.02)? b (80 - 104.46)? 
85.71 109.02 104.46 
80 - 77.14)? , 00- 98.84)? , 400 - 94.02) 
7744 98.84 94.02 
" (90 - 80)? E (100 — 102.5)? M (90 -- 97.5)? 
80 102.5 97.5 


In general, what is the degrees of freedom of Pearson's X? statistic in testing hy- 
pothesis of independence in a two-way contingency table? If there are m rows and n 
columns, then the total number of cells is mn but we have estimated p; , i = 1,...,m 
which gives m - 1 parameters estimated. Similarly, p j, j = 1,...,n gives n - 1 param- 
eters estimated because in each case the total probability p = 1. Thus the degrees 
of freedom is mn - (m - 1) - (n- 1) - 1 = (m - 1)(n - 1). In our case above, m = 4 and 
n = 3, and hence the degrees of freedom is (m - 1)(n - 1) = (3)(2) = 6. The chi-square 
approximation is good when the expected frequency in each cell ej 2 5 for all i and j 
and mn > 5. In our example above, the conditions are satisfied and we can expect a 
good chi-square approximation for Pearson's X? statistic or 


X*=Xln ayn» formn>5, egj25 foralliandj. (13.36) 


In our example, if we wish to test at a 5% level of rejection, then the tabulated value 


of Xé.0.05 = 12.59. Hence it is not necessary to compute each and every term in X*. Com- 


(50-7714) _ 
7044 — ki 9.55. 


The (1,3)-th term gives (120-94.02)" = 7.18. Hence the sum of these two terms alone ex- 


ceeded the critical point 12.59 and hence we reject the hypothesis of independence of 
the characteristics of classification here, at a 596 level of rejection. 


pute the terms with the largest deviations first. The (1, 1)-th term gives 


Note 13.13. Rejection of our hypothesis of independence of the characteristics of 
classification does not mean that there is association between the characteristics. 
In the beginning stages of the development of statistics as a discipline, people 
were making all sorts of contingency tables and claiming that there were associa- 
tion between characteristics of classification such as the habit of wearing tall hats 
and longevity of life, etc. Misuses went to the extent that people were claiming that 
*anything and everything could be proved by statistical techniques". Remember 
that nothing is established or proved by statistical techniques, and as remarked 
earlier, that non-rejection of Ηρ cannot be given any meaningful interpretations 
because the statistical procedures do not support or justify to make any claim if Hy 
is not rejected. 


428 — 13 Tests of statistical hypotheses 


13.9.3 Kolmogorov-Smirnov statistic D, 


D, is already stated in (13.29), which is, 
D, = sup|S,(x) - F(x)| 
x 


where S,,(x) is the sample distribution function and F(x) is the population distribu- 
tion, under the population assumed by the hypothesis. Let the hypothesis Ηρ be that 
the underlying distribution is continuous, has density f(x) and distribution func- 
tion F(x), such as f(x) is an exponential density. Then F(x) will produce a continuous 
curve and S, (x) will produce a step function as shown in Figure 13.15. 


F(x) 


Figure 13.15: Sample and population distribution functions 
X x» ἂν Xn S, (X) and F(x). 


We will illustrate the computations with a specific example. 


Example 13.13. Check to see whether the following data could be considered to have 
come from a normal population N (u = 13, 0? = 1). Data: 16, 15, 15,9, 10, 10, 12, 12, 11, 13, 
13, 13, 14,14. 


Solution 13.13. Here, Ηρ: is that the population density is 
1 1 2 
(x z= — e30) 
f(x) RT 
and, therefore, 


F(x) = Γ fat 


Atx =9, 
-4 


F(9) = f foa - | 


de Xx yis 

: lo “πο 2 dy = 0.0000 

from Ν(Ο, 1) tables. For x = 10, F(10) = 0.0013 from Ν(0, 1) tables. But at x = 9 the sam- 
ple distribution function is S, (x) = s,4(9) = a = 0.0714 and this remains the same from 
9 < x < 10. Hence, theoretically we have |S,(x) - F(x)| = 0.0714 — 0.0000 = 0.0714 at 
x 2 9. But lim, 49 S,(x) = ü , and hence we should take the difference at the point 
X = 10 also, which is |S,, (9) — F(10)| = [0.0714 — 0.0013| = 0.0701. Hence, for each inter- 
val we should record these two values and take the largest of all such values to obtain 


an observed value of D,. The computed values are given in the following table: 


x 


Frequency 


πο) δ πι ο π 


Ε(χ) 


0.0000 
0.0013 
0.0228 
0.1587 
0.5000 
0.8413 
0.9772 
0.9987 


13.9 Some non-parametric tests —— 429 


Sig) 


1/14 = 0.0714, 9 x x «10 
3/14 = 0.2142, 10x x < 11 
4/14 = 0.2856, 11 <x < 12 
6/14 = 0.4284, 112 x x < 13 
9/14 = 0.6426, 13 < x < 14 
11/14 = 0.7854, 14 < x < 15 
13/14 = 0.9282, 15 x x « 16 
14/14 = 1.0000, 16 x x < oo 


In the following table, we have two points for each interval. These are given against 


the x-values 


x (S400) - ΕΙ 


9 0.0714, 0.0701 
10 0.2129, 0.1914 
11 0.2628, 0.1269 
12 02697, 0.0716 
13 0.1426, 0.1987 
14 0.1987, 0.1918 
15 0.1918, 0.0705 
16 0.0705, 0.0000 


The largest of the entries in the last two columns is 0.2697, and hence the observed 
value of D}, = 0.2697. Tables of D, are available. The tabled value of D, = 0.35. We 
reject the hypothesis only when the distance is large or when the observed D, is bigger 
than the tabulated D,,. In the above example, the hypothesis is not rejected since the 
observed value 0.2697 is not greater than the tabulated value 0.35. 


Note 13.14. When considering goodness-of-fit tests, we have assumed the un- 
derlying populations to be continuous. How do we test the goodness-of-fit of a 
discrete distribution to the given data? Suppose that the data is the following: 
3,3,5,5,5,5,6,6 or x = 3 with frequency n, = 2, x = 5 with frequency n, = 4, x3 = 6 
with frequency n; = 2. Whatever is seen here is the best fitting discrete distribution, 


namely, 


f= 


2/8, x-3 
4/8, x=5 
2/8, x=6 
0, elsewhere. 


There is no better fitting discrete distribution to this data. Hence, testing goodness- 
of-fit of a discrete distribution to the data at hand does not have much meaning. 


430 — 13 Tests of statistical hypotheses 


There are a few other non-parametric tests called sign test, rank test, run test, etc. 
We will give a brief introduction to these. For more details, the students must consult 
books on non-parametric statistics. 


13.9.4 The sign test 


The sign test is applicable when the population is continuous and when we know that 
the underlying population is symmetric about a parameter 0 or the population den- 
sity f (x) is symmetric about x = 0, such as a normal population, which is symmetric 
about x = u where u = E(x). We wish to test a hypothesis on 0. Let Ηρ: 0 = 0, (given). 
Then under Ηρ the underlying population is symmetric about x = 9). In this case, the 
probability of getting an observation from this population less than θρ is 7 - the prob- 
ability of getting an observation greater than 05, and due to continuity, the probability 
of getting an observation equal to 0, is zero. Hence finding an observation above ϐρ 
can be taken as a Bernoulli success and finding an observation below 6, as a failure, 
or vice versa. Therefore, the procedure is the following: Delete all observations equal 
to 05. Let the resulting number of observations be n. Put a plus sign for an observation 
above 0, and put a minus sign for observations below 65. Count the number of + signs. 
This number of + signs can be taken as the number of successes in n Bernoulli trials. 
Hence we can translate the hypothesis Ho : 0 = 09, H; : 0 + 0o into Hy : p = j, H: p+ 3 
where pis the probability of success in a Bernoulli trial. Then the test criterion for a test 
at the level of rejection a can be stated as follows: Reject Ηρ if the observed number 
of plus signs is small or large. For convenience, we may cut off equal tail probabili- 
ties 5 at both ends. Let y be the number of plus signs then compute y, and y, such 
that 


Υ MT -05) «7 (a) 
y=0 

and 
Y (osa -057 «7. (b) 


y-n 


If the observed number of plus signs is less than γρ or greater than γι, then reject Ηρ. 
Since the test is based on signs, it is called a sign test. 


Example 13.14. The following is the data on the yield of wheat from 12 test plots: 
5,1, 8,9,11, 4,7, 12.5,6,8,9. Assume that the population is symmetric about μ = E(x) 
where x is the yield in a test plot. Test the hypothesis that u = 9 at the level of rejection 
of 596. 


13.9 Some non-parametric tests —— 431 


Solution 13.14. In our notation, 0, = 9. There are two observations equal to 9, and 
hence delete these and there are 10 remaining observations. Mark the observations 
bigger than 9 by a plus sign: 11(+), 12(+). There are two observations bigger than 9 and 
then the observed number of successes in 10 Bernoulli trials is 2. Here, a - 0.05 or 
5 = 0.025. From a binomial table for p = 3 , we have yọ = 1 and y, = 9, which are the 
closest points where the probability inequalities in (a) and (b) above are satisfied. Our 
observed value is 2 which is not in the critical region, and hence the hypothesis is not 


rejected at the 596 level of rejection. 


Note 13.15. Does it mean that there is a line of symmetry at x = 9 for the distribution? 
For the same data if Hp is 05 = 8, then again n = 10 and the observed number of suc- 
cesses is 4 and Hj is not rejected. We can have many such values for 0, and still the 
hypotheses will not be rejected. That does not mean that the underlying distribution 
has symmetry at all these points. Besides, p = I is not uniquely determining a line of 
symmetry. Hence trying to give an interpretation for non-rejection is meaningless. Due 
to this obvious fallacy some people modify the hypothesis saying that at x = 0, there 
is the median of the underlying population. Again, p = 3 does not uniquely determine 
X = 05 as the median point. There could be several points qualifying to be the median. 
Hence non-rejection of Ηρ cannot be given a justifiable interpretation. 


13.9.5 The ranktest 


This is mainly used for testing the hypothesis that two independent populations are 
identical, against the alternative that they are not identical. If x is the typical yield of 
ginger from a test plot under organic fertilizer and y is the yield under a chemical fertil- 
izer, then we may want to claim that x and y have identical distributions, whatever be 
the distributions. Our observations may be of the following forms: n, observations on 
x and n, observations on y are available. Then for applying a rank test the procedure 
is the following: Pool the observations on x and y and order them according to their 
magnitudes. Give to the smallest observation rank 1, the second smallest rank 2 and 
the last one rank n, + n. If two or more observations have the same magnitude, then 
give the average rank to each. For example, if there are two smallest numbers then 
each of these numbers gets the ranks Um = 1.5 and the next number gets the rank 3, 
and so on. Keep track of the ranks occupied by each sample. A popular test statistic 
based on ranks is the Mann-Whitney u-test where 


n(n + 1) 


u-niQn,-c 
1.12. 2 


R (13.37) 


where n; and n, are the sample sizes and Κι is the sum of ranks occupied by the sam- 
ple with size πι. Under the hypothesis of identical distributions for x and y, it can 


432 —— 13 Tests of statistical hypotheses 


be shown that the mean value and the variance are E(u) = a and Var(u) = ož = 
nn ο) and that 


s u- E(u) 


u 


= N(0,1) (13.38) 


or v is approximately a standard normal, and a good approximation is available for 
n, 2 8, n, 2 8. Hence we reject the hypothesis if the observed value of |v| > Za [see 
Figure 13.5]. 


Example 13.15. The following are the observations on waiting times, x, at one check- 
out counter in a departmental store on 10 randomly selected occasions: 10, 5, 18, 12, 3, 8, 
5, 8, 9,12. The following are the waiting times, y, on randomly selected 8 occasions at 
another checkout counter of the same store: 2,5, 3, 4, 6,5,6,9. Test the hypotheses, at 
a 5% level of rejection, that x and y are identically distributed. 


Solution 13.15. Since the size of the second sample is smaller, we will take that as 
the one for computing the sum of the ranks. Then in our notation, n, = 8, n, = 10. 
Let us pool and order the numbers. A subscript a is put for numbers coming from the 
sample with size n, for identification. The following table gives the numbers and the 
corresponding ranks: 


Numbers 2, 3, 3, 4,4 5e 5q 5 5, 64 
Ranks la 25, 25, 4, 65, 65, 65 65 9.5, 


Numbers 6,, 8, δ 9, 9, 10, 12 12 18. 
Ranks 9.5, 115 115 135, 135 15 165 165 18 


Total number of ranks occupied by the sample of size n,, and the observed values of 
other quantities are the following: 


R4 =1.0 + 2.5 + 2.5 + 4.0 + 6.5 + 6.5 + 9.5 + 9.5 + 13.5 = 55.5; 
(8)(10) 


E(u) = = 40; 
MOL ED NU M 


u=(8)(10) + ÈO -55.5=610; HEV 61-40 |^. 
2 " 1125 


At the 596 level of rejection in a standard normal case, the critical point is 1.96, and 
hence we do not reject the hypothesis here. 


13.9 Some non-parametric tests —— 433 


Note 13.16. Again, non-rejection of the hypothesis of identical distribution cannot 
be given meaningful interpretation. The same observations could have been ob- 
tained if the populations were not identically distributed. Observed values of u and 
R, or the formula in (13.38) do not characterize the property of identical distribu- 
tions for the underlying distributions. Hence, in this test as well as in the other tests 
to follow, non-rejection of the hypothesis should not be given all sorts of interpre- 
tations. 


Another test based on the rank sums, which can be used in testing the hypotheses 
that k given populations are identical, is the Kruskal-Wallis H-test. Suppose that the 
samples are of sizes n;, i = 1, ... ,k. Again, pool the samples and order the observations 
from the smallest to the largest. Assign ranks from 1 to n = n; + --- + n, distributing 
the averages of the ranks when some observations are repeated. Let R; be the sum of 
the ranks occupied by the i-th sample. Then 
k p2 

zat 


12 R 
H- - Dexz 13. 
CERIO 3(n +1) s Xt 4 (13.39) 


i 
where n =n; +- +n, and H is approximately a chi-square with k — 1 degrees of free- 
dom under the null hypothesis that all the k populations are identical. The approxi- 
mation is good for n; 2 5, i = 1, ..., k. Here, we reject the hypothesis for large values of 
H only. 


13.9.6 The run test 


This test is usually used for testing randomness of an event. Suppose that in a produc- 
tion queue, an item is produced in succession. If the item satisfies the quality specifi- 
cations, then we call it a good item, denoted by G and if the item does not satisfy the 
quality specifications then we call it defective, denoted by D. Hence the production 
queue can be described by a chain of the letters G and D, such as GGGDGGGDDGG, 
etc. A succession of identical symbol is called a run. In our example, there is one run 
of size 3 of G, then one run of size 1 of D then one run of size 3 of G, then one run 
of size 2 of D, then one run of size 2 of G. Thus there are 5 runs here. The number of 
times the symbol G appears is n, - 8 and the number of times the symbol D appears 
is n, = 3 here. Consider a sequence of two symbols, such as G and D, where the first 
symbol appears n, times and the second symbol appears n, times. [Any one of the two 
symbols can be called the first symbol and the other the second symbol.] Let the total 
number of runs in the sequence be R. Then under the hypothesis that the first symbol 
(or the second symbol) is appearing in the sequence at random or it is random occur- 
rence and does not follow any particular pattern, we can compute the expected value 
and variance of R. Then it can be shown that the standardized R is approximately a 
standard normal for large n, and n,. The various quantities, under the hypothesis of 


434 — 13 Tests of statistical hypotheses 


randomness, are the following: 


E(R) = 2njn, rise 2nnjOn,n, - n, - n) 
nı +N, (ny πο) (πι + n; -- 1) 
R-E(R 
T= (R) = N(0,1) forn 210, n, 2 10. (13.40) 
0, 


The ideas will be clear from the following example. Note that we reject the null hy- 
pothesis of randomness if |T| > Za for a test at the level of rejection a. 


Example 13.16. Test the hypothesis that in a production process, where an item can 
be D = defective and G = good, the defective items or D’s are appearing at random, 
based on the following observed sequence of D’s and G’s, at the 5% level of rejection: 
Observed sequence 


GGGDDGGDGGGGGDDDGGGGDDGDDD 


Solution 13.16. Let πι be the number of D’s and n, the number of G’s in the given 
sequence. Here, n, = 11, n; = 15. The number of runs R = 10. Under the hypothesis of 
randomness of D, the observed values are the following: 


5 2(11)(15) 2. 2(11)(15)(2(11)(15) - 11 - 15) 


E(R 12 13.69; 
qi 11-15 7 OR (11+ 15)2(11 + 15 - 1) 
R-E(R) 10 - 13.69 
«sow E a] -1δι 
TI 0, v5.94 


Here, a = 0.05 or $ = 0.025 and then 2999, = 1.96 from standard normal tables. The 
observed value of |ΤΙ is 1.51, which is not greater than 1.96, and hence the hypothesis 
is not rejected. 


Exercises 13.9 


13.9.1. A bird watcher reported that she has spotted birds belonging to 6 categories in 
a particular bird sanctuary and her claim is that these categories of birds are frequent- 
ing this sanctuary in the proportions 1:1:2:3:1:2 in these 6 categories of birds. 
Test this claim, at 596 level of rejection, if the following data are available: 


Category 1 2 3 4 5 6 
Frequency 6 7 13 17 6 10 


13.9.2. Thetelephone calls received by a switchboard in successive 5-minute intervals 
is given in the following table: 


Number of calls = x ο 1 2 3 4 5 6 
Number of intervals (frequency) 25 30 20 15 10 8 10 


Test the hypothesis that the data is compatible with the assumption that the telephone 
calls are Poisson distributed. Test at a 5% level of rejection. 


13.9 Some non-parametric tests —— 435 


13.9.3. The following table gives the increase in sales in a particular shop after placing 
advertisements. Test the *goodness-of-fit" of an exponential distribution to this data, 
at a 596 level of rejection. 


Increase 5-10 11-15 16-22 23-27 28-32 »33 
Frequency 200 100 170 140 100 25 


[Note: Make the intervals continuous and take the mid-points as representative values, 
when estimating the mean value.] 


13.9.4. The following contingency table gives the frequencies of people classified ac- 
cording to their mood and intelligence. Test, at a 196 level of rejection, to see whether 
there is any association between these two characteristics of classification, where 
I - intelligent, IJ = average, III = below average. 


Intelligence ^ I H IH 


Mood | 
Good 15 10 10 
Tolerable 8 10 10 
Intolerable 8 10 15 


13.9.5. The following table gives the telephone calls received by a switchboard on 265 
days. Test whether or not a Poisson distribution with parameter A = 2 is a good fit, 
by using the Kolmogorov-Smirnov test D,,. [The tabulated value of D»; = 0.085 at 5% 
level of rejection.] 


χ ο 1 2 3 4 5 6 7 8 

Frequency 52 68 60 40 22 10 3 1 0 

13.9.6. The following table gives the waist measurements of 35 girls. Test the good- 
ness-of-fit of a N(u = 15,0? = 4) to the data by using D, statistic. [The observed value 
of D3, = 0.27 at 0.01 level of rejection.] 


Xx 10 12 13 14 15 17 18 
Frequency 2 4 6 10 7 4 2 


13.9.7. The following are the observations on the grades obtained by 14 students in 
a particular course. Assume that the typical grade x has a symmetric distribution 
around E(x) = p. Test the hypothesis at a 5% level of rejection that u = 80, against 
u + 80. Data: 90,100, 60, 40, 80, 60,50, 40,55, 62, 80,80, 30,80. 


13.9.8. The following are the grades of a randomly selected 10 students under method 
1 of teaching: 90, 95, 80, 80, 85, 70, 73, 82, 83, 80 and the following are the grades of 8 
randomly selected students under method 2 of teaching: 40,50,55, 60, 65, 45, 70, 100. 
If the students have the same background test, the hypothesis that both the methods 
are equally effective, by using a u-test at a 596 level of rejection. 


436 — 13 Tests of statistical hypotheses 


13.9.9. Use a Kruskal-Wallis' H-test to test at a 5% level of rejection that the following 
three populations, designated by the random variables χι, x, and x3, are identical, 
based on the following data: Data on 


X1 :5,2,5,6,8,10,12,11, 10; 
X» : 15,16,2, 8, 10, 14, 15,15,18 
X3 : 20,18, 30, 15, 10, 11, 8, 15, 18, 12. 


13.9.10. Test the hypothesis of randomness of the occurrence of the symbol Dina 
production process, based on the following observed sequence [test at a 1% level of 
rejection]: 


GGDDDGDGGGDDGGGGDGGDDGGGGGDGGGGD. 


14 Model building and regression 


14.1 Introduction 


There are various types of models that one can construct for a given set of data. The 
types of model that is chosen depends upon the type of data for which the model is to 
be constructed. If the data are coming from a deterministic situation, then there may 
be already an underlying mathematical formula such as a physical law. Perhaps the 
law may not be known yet. When the physical law is known, then there is no need 
to fit a model, but for verification purposes, one may substitute data points into the 
speculated physical law. For example, a simple physical law for gases says that the 
pressure P multiplied by volume V is a constant under a constant temperature. Then 
the physical law that is available is 


PV=c 


where c is a constant. When it is a mathematical relationship, then all pairs of obser- 
vations on P and V must satisfy the equation PV = c. If V, is one observation on V 
and if the corresponding observation is P, for P, then P,V, -- c for the same constant 
c. Similarly, other pairs of observations (Ρ., V,),... will satisfy the equation PV = c. If 
there are observational errors in observing P and V, then the equation may not be 
satisfied exactly by a given observational pair. If the model proposed PV - c is not 
true exactly, then of course the observational pairs (P, V;) for some specific i may not 
satisfy the equation PV - c. There are many methods of handling deterministic sit- 
uations. The usual tools are differential equations, difference equations, functional 
equations, integral equations, etc. For more details on deterministic situations and 
the corresponding model, see the CMS publication of 2010 SERC Notes. 


14.2 Non-deterministic models 


Deterministic situation is governed by definite mathematical rules. There is no chance 
variation involved. But most of the practical situations, mostly in social sciences, eco- 
nomics, commerce, management, etc. as well as many physical phenomena, are non- 
deterministic in nature. An earthquake at a place cannot be predetermined but with 
sophisticated prediction tools we may be able to predict the occurrence to some ex- 
tent. We know that Meenachil River will be flooded during the monsoon season but 
we cannot tell in advance what the flood level will be on July 1, 2020, in front of Pastoral 
Institute. Even though many factors about weight gain are known, we cannot state for 
sure how much the weight gain will be on a cow under a certain feed. A student who 
is going to write an examination cannot tell beforehand what exact grade she is go- 
ing to get. She may be able to predict that it will be above 8096 from her knowledge 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. ABARH] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https://doi.org/10.1515/9783110562545-014 


438 —— 14 Model building and regression 


about the subject matter. But after writing the examination she might be able to give a 
better prediction that she would get at least 9500 if not 10096. She could improve her 
prediction by knowing additional information of having written the examination. 

Situations described above and many other situations of the same type are not 
deterministic in nature. When chance variation is involved, prediction can be made by 
using properties of random variables or chance variables or measurement of chance 
or probabilities. 

Since alot of practical situations are random or non-deterministic in nature, when 
we talk about model building, people naturally think that we are trying to describe 
a random or non-deterministic situation by mathematical modeling. Attempts to de- 
scribe random situations have given birth to different branches of science. Stochastic 
process is one such area where we study a collection of random variables. Time series 
analysis is an area where we study a collection of random variables over time. Regres- 
sion is an area where we try to describe random situations by analyzing conditional 
expectations. As can be seen, even to give a basic description of all the areas and dis- 
ciplines where we build models, it will take hundreds of pages. Hence what we will do 
here is to pick a few selected topics and give a basic introduction to these topics. 


14.2.1 Random walk model 


Consider a simple practical situation of a drunkard coming out of the liquor bar, de- 
noted by S in Figure 14.1. 


S 
H B 


Figure 14.1: Random walk on a line. 


He tries to walk home. Since he is fully drunk, assume that he walks in the following 
manner. At every minute, he will either take a step forward or backward. Let us assume 
a straight line path. Suppose that he covers 1 foot (about a third of a meter) at each 
step. He came out from the bar as indicated by the arrow. Then if his first step is to the 
right, then he is one foot closer to home, whereas if his first step is to the left, then he 
is farther way from home by one foot. At the next minute, he takes the next step either 
forward or backward. If he had taken the second step also to backward, then now he 
is farther away from home by two feet. One can associate a chance or probability for 
taking a step to the left (backward) or right (forward) at each stage. If the probabilities 
are 7 each, then at each step there is 5096 chance that the step will be forward and 
50% chance that the step will be backward. If the probabilities of going forward and 
backward are 0.6 and 0.4, respectively, then there is a 6096 chance that his first step 
will be forward. 


14.2 Non-deterministic models —— 439 


Some interesting questions to ask in this case are the following: What is the 
chance that eventually he will reach home? What is the chance that eventually he 
will get lost and walk away from home to infinity? Where is his position after n steps? 
There can be several types of modifications to the simple random walk on a line. In 
Figure 14.1, a point is marked as B. This B may be a barrier. This barrier may be such 
that once he hits the barrier he falls down dead or the walk is finished, or the barrier 
may be such that if he hits the barrier there is a certain positive chance of bouncing 
back to the previous position so that the random walk can continue and there may be 
a certain chance that the walk is finished, and so on. 

An example of a 2-dimensional random walk is the case of Mexican jumping 
beans. There is a certain variety of Mexican beans (lentils). If you place a dried bean 
on the tabletop, then after a few seconds it jumps by itself in a random direction to 
another point on the table. This is due to an insect making the dry bean as its home 
and the insect moves around by jumping. This is an example of a two-dimensional 
random walk. We can also consider random walk in space such as a dust particle 
moving around in the atmosphere and random walk in higher dimensions. 


14.2.2 Branching process model 


In nature, there are many species which behave in the following manner. There is a 
mother and the mother gives birth to k offsprings once in her lifetime. After giving 
birth, the mother dies. The number of offsprings could be 0,1,2, ..., k where k is a fi- 
nite number, not infinitely large. A typical example is the banana plant. One banana 
plant gives only one bunch of bananas. You cut the bunch and the mother plant is de- 
stroyed. The next generation offsprings are the new shoots coming from the bottom. 
The number of shoots could be 0,1,2,3, 4,5, usually a maximum of 5 shoots. These 
shoots are the next generation plants. Each shoot, when mature, can produce one 
bunch of bananas each. Usually, after cutting the mother banana plant, the farmer 
will separate the shoots and plant all shoots, except one, elsewhere so that all have 
good chances of growing up into healthy banana plants. 

Another example is the pineapple. Again one pineapple plant gives only one 
pineapple. The pineapple itself will have one shoot of plant at the top of the fruit and 
other shoots will be coming from the bottom. Again the mother plant is destroyed 
when the pineapple is plucked. Another example is certain species of spiders. The 
mother carries the sack of eggs around and dies after the new offsprings are born. 
Another example is salmon fish. From the wide ocean, the mother salmon enters into 
a fresh water river, goes to the birthplace of the river, overcoming all types of obstacles 
on the way, and lays one bunch of eggs and then promptly dies. Young salmon come 
out of these eggs and they find their way down river to the ocean. The life cycle is 
continued. 


440 — 14 Modelbuilding and regression 


Assume that the last name of a person, for example, *Rumfeld" is carried only by 
thesons and not by the daughters. It is assumed that the daughters take the last names 
of their husbands. Then there is a chance that the father's last name will be extinct 
after some generations. What is the chance that the name Rumfeld will disappear from 
the Earth? 

These examples are examples for branching processes. Interesting questions to 
ask are the following: What is the chance that the species will be extinct eventually? 
This can happen if there is a positive probability of having no offspring in a given birth. 
What is the expected population size after n generations? The branching process is a 
subject matter by itself and it is a special case of a general process known as birth and 
death processes. 


14.2.3 Birth and death process model 


This can be explained with a simple example. Suppose that there is a good pool area 
in a river, a good fishing spot for fishermen. Fish move in and move out of the pool 
area. If N(t) is the number of fish in the pool area at time t and if one fish moved out at 
the next unit of time, then N(t + 1) = N(t) - 1. On the other hand, if one fish moved into 
the pool area at the next time unit then N(t 1) = N(t) +1. When one addition is there, 
then one can say that there is one birth and when one deletion is there we can say that 
there is one death. Thus if we are modeling such a process where there is possibility 
of birth and death, then we call it a birth and death model. 


14.2.4 Time series models 


Suppose that we are monitoring the flood level at the Meenachil River at the Pastoral 
Institute. If x(t) denotes the flood levelon the t-th day, time - t being measured in days, 
then at the zeroth day or starting of the observation period the flood level is x(0), the 
next day the flood level is x(1) and so on. We are observing a phenomenon namely 
flood level over time. In this example, time is measured in discrete time units. Details 
of various types of processes and details of the various techniques available for time 
series modeling of data are available in SERC School Notes of 2005-2010. Since these 
are available to the students, the material will not be elaborated here. But some more 
details will be given in the coming chapters. 


14.3 Regression type models 


The first prediction problem that we are going to handle is associated with a concept 
called regression and our models will be regression-type models. When Sreelakshmy 


14.3 Regression type models — 441 


was born, her doctor predicted by looking at the heights of parents and grandparents 
that she would be 5'5" at the age of 11. When she grew up, when she hit 11 years of age 
and her height was only 5'2”. Thus the true or observed height was 5'2” against the 
predicted height of 5'5” . Thus the prediction was not correct and the error in the pre- 
diction = observed minus predicted = 5'2" —5'5" -. 3’ Thus the prediction was off by 
3" in magnitude. [We could have also described error as predicted minus observed.] 
Of course, the guiding principle is that smaller the distance between the observed and 
predicted, better the prediction. Thus we will need to consider some measures of dis- 
tance between the observed and predicted. When random variables are involved, some 
measures of distances between the real scalar random variable x and a fixed point a 
are the following: 


E|x -- αἱ = mean deviation or expected difference between x and a (i) 
1 

[Elx - a?]? = mean square deviation between x and a (ii) 
1 

[Ex - a], r=1,2,3,... (iii) 


where E denotes the expected value. Many such measures can be proposed. Since we 
are going to deal with only mean deviation and mean square deviations mainly, we 
will not look into other measures of distance between x and a here. For the sake of 
curiosity, let us see what should be a, if a is an arbitrary constant, such that E|x — αἱ 
is a minimum or what should be an arbitrary constant b such that [E|x — bp]: isa 
minimum? 


14.3.1 Minimization of distance measures 


Definition 14.1 (A measure of scatter). A measure of scatter in real scalar random 
variable x from an arbitrary point a is given by \JE(x — a)?. 


What should be a such that this dispersion is the least? Note that minimization of 
\E (x - a)? is equivalent to minimizing E(x — a)’. But 
E(x - a = E(x - E(x) + E(x) - a)? by adding and subtracting E(x) 
= E(x - E(x))’ + E(E(x) - a + 2E(x - E(x))(E(x) - a) 
= E(x - Eœ) + (E(x) - a)? (14.1) 


because the cross product term is zero due to the fact that (E(x) — a) is a constant and, 
therefore, the expected value applies on (x — E(x)), that is, 


E(x - E(x)) = E(x) - E(E(x)) = E(x) - E(x) =0 


442 — 14 Model building and regression 


since E(x) is a constant. In the above computations, we assumed that E(x) is finite or it 
exists. In (14.1), the only quantity containing a is [E(x) - a]? where both the quantities 
E(x) and a are constants, and hence (14.1) attains its minimum when the non-negative 
quantity [E(x) — a]? attains its minimum which is zero. Therefore, 


[E(x) -- α]’ - 0 => E(x)-a=0 5 E(x)=a. 


Hence the minimum is attained when a = E(x). [The maximum value that E(x- a)? can 
attain is -co because a is arbitrary.] We will state this as a result. 


Result 14.1. For real scalar random variable x, for which E(x) exists or a fixed finite 
quantity, and for an arbitrary real number a 


min[E(x - a]: > min E(x - a)? > a-E(x. (14.2) 


Ina similar fashion, we can show that the mean deviation is least when the devi- 
ation is taken from the median. This can be stated as a result. 


Result 14.2. For a real scalar random variable x, having a finite median, and for an 
arbitrary real number b, 


min Elx - b| => b=M 
where M is the median of x. 


The median M is the middle value for x in the sense that 
Pr{x < M} > : and ΡΗΧΣΜ}Σ 


where Pr{-} denotes the probability of the event {-}. The median M can be unique or 
there may be many points qualifying to be the median for a given x depending upon 
the distribution of x. 


14.3.2 Minimum mean square prediction 


First, we will consider the problem of predicting one real scalar random variable by 
using one real scalar random variable. Such situations are plenty in nature. Let y be 
the variable to be predicted and let x be the variable by using which the variable y is 
predicted. Some times we call the variable y to be predicted as dependent variable and 
the variable x, which is independently preassigned to predict y, is called the indepen- 
dent variable. This terminology should not be confused with statistical independence 
of random variables. Let y be the marks obtained by a student in a class test and x be 


14.3 Regression type models — 443 


the amount of study time spent on that subject. The type of question that we would 
like to ask is the following: Is y a function of x? If so, what is the functional relation- 
ship so that we can use it to evaluate y at a given value of x. If there is no obvious 
functional relationship, can we use a preassigned value of x to predict y? Can we an- 
swer a question such as if 20 hours of study time is used what will be the grade in the 
class test? In this case, irrespective of whether there exists a relationship between x 
and y or not, we would like to use x to predict y. 

As another example, let y be the growth of a plant seedling, (measured in terms of 
height), in one week, against the amount of water x supplied. As another example, let 
y bethe amount of evaporation of certain liquid in one hour and x be the total exposed 
surface area. 

If x is a variable that we can preassign or observe, what is a prediction function 
of x in order to predict y? Let $(x) be an arbitrary function of x that we want to use 
as a predictor of y. We may want to answer questions like: what is the predicted value 
of y if the function (x) is used to predict y at x = xy where x, is a given point. Note 
that infinitely many functions can be used as a predictor for y. Naturally, the predicted 
value of y at x = x; will be far off from the true value if $(x) is not a good predictor for 
y. What is the “best” predictor, “best” in some sense? If y is predicted by using (x) 
then the ideal situation is that $(x), at every given value of x, coincides with the cor- 
responding observed value of y. This is the situation of a mathematical relationship, 
which may not be available in a problem in social sciences, physical sciences and nat- 
ural sciences. For the example of the student studying for the examination if a specific 
function @(x) is there, where x is the number of hours of study, then when x = 3 hours 
φίχ)|χ-α = $(3) should produce the actual grade obtained by the student by spending 
3 hours of study. Then $(x) should give the correct observation for every given value 
of x. Naturally this does not happen. Then the error in using (x) to predict the value 
of y at a given x is 


y-$(x) orwemaytakeas d(x)-y. 


The aim is to minimize a “distance” between y and $(x) and thus construct $. Then 
this $ will bea “good” $. This is the answer from common sense. We have many math- 
ematical measures of “distance” between y and ¢(x) or measures of scatter in e = y — 
φία). One such measure is JE[y — $(x)P? and another measure is Ely - $(x)|, where y is 
areal scalar random variable and x is a preassigned value of another random variable 
or more precisely, for the first measure of scatter the distance is JE[y - (x = x9)]?, 
that is, at x = xg. Now, if we take this measure then the problem is to minimize over all 
possible functions $. 


min γεῖν -$x-xy] = min Ely - P(x =X,)]”. 


But from (14.2) it is clear that the “best” function $, best in the minimum mean square 
sense, that is, minimizing the expected value or mean value of the squared error, is 


444 — 14 Model building and regression 


$ = E(ylx = xo) or simply $ = E(y|x) = conditional expectation of y given x. Hence this 
“best predictor”, best in the minimum mean square sense, is defined as the regression 
of y on x. Note that if we had taken any other measure of scatter in the error e we 
would have ended up with some other function for the best @. Hereafter, when we say 
“regression”, we will mean the best predictor, best in the minimum mean square sense 
or the conditional expectation of y given x. Naturally, for computing E(y|x) we should 
either have the joint distribution of x and y or at least the conditional distribution of y, 
given x. 


Definition 14.2 (Regression of y on x). It is defined as E(y|x) = conditional expec- 
tation of y given x whenever it exists. 


Note 14.1. There is quite a lot of misuse in this area of “regression”. In some ap- 
plied statistics books, the concept of regression is mixed up with the least square 
estimates. Hence the students must pay special attention to the basic concepts and 
logics of derivations here so that the topic is not mixed up with least square estima- 
tion problem. Regression has practically very little to do with least square estima- 
tion. 


Example 14.1. If x and y have a joint distribution given by the following density func- 
tion, where both x and y are normalized to the interval [0,1], 


x+y, O<x<LO<y<l 
foy) - 
0, elsewhere, 


what is the *best" predictor of y based on x, best in the minimum mean square sense? 
Also predict y at (i) x = i (ii) x 2 1.5. 


Solution 14.1. As per the criterion of “best”, we are asked to compute the regression 
of y on x or E(y|x). From elementary calculus, the marginal density of x, denoted by 
fi O0, is given by 


1 > O0<x<1 
fico - | fæydy= | oc dy = E es 
y 


0, elsewhere. 


Then the conditional density of y given x is 


and zero elsewhere. Hence the conditional expectation of y given x is then 


E(yl) = | ys (ylx)dy = ΕΣ 


ylx 


ν 


14.3 Regression type models —— 445 


1 1 x i 
E τ (xy εγ)ὰν- 2+. 


X+5 4y=0 ΧΕ} 


Hence the best predictor of y based on x in Example 14.1 is 


E(ylx) = 


for all given admissible values of x. This answers the first question. Now, to predict 
y ata given x we need to only substitute the value of x. Hence the predicted value at 


ls 
x= 31S 


1 1 
Ζ τς 
-E 


The predicted value of y at x = 1.5 is not available from the above formula because 1.5 is 
not an admissible value of x or it is outside the support 0 x x x 1 of the density g(y|x). 
The question contradicts with what is given as the density in Example 14.1. 


Example 14.2. Suppose that it is found that x and y have a joint distribution given 
by the following: [Again we will use the same notation to avoid introducing too many 
symbols, even though f in Example 14.1 is different from f in Example 14.2.] 


e*Y, 0xx«oo, 0xy«oo 
fen 


0, elsewhere. 


Evaluate the “best” predictor of y based on x, best in the minimum mean square sense, 
and predict y at x = xo. 


Solution 14.2. We need the conditional expectation of y given x, for which we need 
the conditional density of y given x. From the above joint density, it is clear that the 
marginal density of x is 


xX O<x<0co 


e > 
ΠΩ) - 
ς | 0, elsewhere 


and hence the conditional density of y given x is given by 


_ fy) 
g(ylx) = Τί) 


67, Oxy«oo 
go = 
0, elsewhere 


446 ---- 14 Model building and regression 


and the conditional expectation of y given x is 
Eyb) =| ye*ày-1 
0 


[evaluated by using a gamma function as Γ(2) = 1! or by integration by parts]. Here, 
E(y|x) is nota function of x, which means that whatever be the preassigned value of x 
the predicted value of y is simply 1. In other words, there is no meaning in using x to 
predict the value of y because the conditional distribution is free of the conditioned 
variable x. This happens because in this example, x and y are statistically indepen- 
dently distributed. Hence x cannot be used to predict y and vice versa. 

But note that, in Examples 14.1 and 14.2 we have more information about the vari- 
ables x and y than what is necessary to construct the *best" predictor. The best pre- 
dictor is E(y|x), and hence we need only the conditional distribution of y given x to 
predict y and we do not need the joint distribution. Knowing the joint distribution 
means knowing the whole surface in a 3-dimensional space. Knowing the conditional 
distribution means knowing only the shape of the curve when this surface f(x, y) is cut 
by the plane x = x, for some preassigned χρ. [The reader is asked to look at the geome- 
try of the whole problem in order to understand the meaning of the above statement.] 
Thus we can restrict our attention to conditional distributions only for constructing a 
regression function. 


Example 14.3. The strength y of an iron rod is deteriorating depending upon the 
amount of rust x on the rod. The more rust means less strength and finally rust will de- 
stroy the iron rod. The conditional distribution of y given x is seen to be of exponential 
decay model with the density 


es M 
ΠΠ wx, O<y<oo,1+x>0 


0, elsewhere. 


Construct the best predictor function for predicting the strength y at the preassigned 
amount x of rust and predict the strength when x - 2 units. 


Solution 14.3. From the density given above, it is clear that at every x the conditional 
density of y given x is exponential with expected value 1-- x (comparing with a negative 
exponential density). Hence 


E(y|x) =1+x 
is the best predictor. The predicted value of y at x = 2 is 1+ 2 - 3 units. 


Example 14.4. The marks y obtained by the students in an examination is found to 
be normally distributed with polynomially increasing expected value with respect to 
the amount x of time spent. The conditional distribution of y, given x, is found to be 


14.3 Regression type models ---- 447 


normal with the density 
1-1 (y-70-2x-x2)? 
ποτ 2 , -οο«γ«οοθσκχς4. 
7 
Construct the best predictor of y based on x and predict y at x = 3 hours. 


Solution 14.4. From the density itself, itis clear that the conditional density of y given 
x is N(u, 0°) with σ΄ = 1and u = 70 + 2x + x? and, therefore, 


E(ylx) 2 70 + 2x +x? 
is the best predictor of y, and the predicted marks at x = 3 is 
70 + 2(3) +37 = 85. 


In Examples 14.1 and 14.4, the regression function, that is, E(y|x), is a non-linear func- 
tion of x. 
X 1 
23 
ΕΦΙΧ) - 4—1 in Example 14.1 
X σας 
2 
E(y|x)=70+2x+x? in Example 14.4 


whereas 
E(y|x)=1+x in Example 14.3 


which is a linear function in x. Thus the regression of y on x may or may not be lin- 
ear function in x. In Example 14.3, if x and y had a joint bivariate normal distribution, 
then we know for sure that E(y|x) is a linear function in x. Thus one should not con- 
clude that regression being a linear function in x means that the variables are jointly 
normally distributed because it is already seen that in Example 14.3 the regression is 
linear in x but it is not a case of joint normal distribution. 


Note 14.2. The word “regression” means to go back, to regress means to go back. 
But in a regression-type prediction problem we are not going back to something. 
We are only computing the conditional expectation. The word “regression” is used 
for historical reasons. The original problem, when regression was introduced, was 
to infer something about ancestors by observing offsprings, and thus going back. 


14.3.3 Regression on several variables 


Again, let us examine the problem of predicting a real scalar random variable y at 
preassigned values of many real scalar variables x,,x5, ...,x,. As examples, we can 
cite many situations. 


448 —— 14 Model building and regression 


Example 14.5. 
y = the marks obtained by a student in an examination 


χι = the amount of time spent on it 
X = instructor's ability measured in the scale 0 < x, «10. 
X; = instructor's knowledge in the subject matter 


X, Ξ student's own background preparation in the area 


Example 14.6. 
y = cost of living 
χι = unit price for staple food 
X; = unit price for vegetables 


X3 = cost of transportation 


and so on. There can be many factors contributing to y in Example 14.5 as well as in 
Example 14.6. We are not sure in which form these variables x, ...,x;, will enter into 
the picture. If (x, ..., x,) is the prediction function for y, then predicting exactly as 
in the case of one variable situation the best prediction function, best in the sense of 
minimum mean square error, is again the conditional expectation of y given Χι,...,Χχ. 
That is, E(y|x;, ...,x;). Hence the regression of y on x,,...,x, is again defined as the 
conditional expectation. 


Definition 14.3. The regression of y on x, ..., x, is defined as the conditional ex- 
pectation of y given x, ...,x,, that is, Ε(γ|Χι,...»Χχ). 


As before, depending upon the conditional distribution of y given x4, ... , x, the re- 
gression function may be a linear function of x,, ... , xy, or may not be a linear function. 


Example 14.7. If y,x;,x, have a joint density, 


Türen s (y +Xq +X); O <y,X, X, <1 
9X4,X2,X3) = 


0, elsewhere 
evaluate the regression of y on x, and xp. 


Solution 14.7. The joint marginal density of x, and x, is given by integrating out y. 
Denoting it by f,(x,,x>), we have 


1 2 
αι - | 30 tx t)dy 


TX, +X), O0xx,x1, 0xxx1 


: elsewhere. 


14.3 Regression type models — 449 


Hence the conditional density of y given x, and x, is 


2 
SVAN) γΈΧ[ΕΧ, 


O<y<l 
g(ybaq.x;)-7 $(59995) 5D” y 
0, elsewhere. 
Therefore, the regression of y on x,, x; is given b 
1» X2 
1 ΓΥΕΧΙΕΧ ο... 
_ 1+X2 LF iE 12 
Eyi x) - | y| 1 Jay 1 $ 
y=0 5 +X ἘΧ2 5 +X ἘΧ2 


This is the best predictor of y. For example, the predicted value of y at x, =0, x; = I is 
given by 


Πο νο 
+] 4+ 
NIB 


In this example, how can we predict y based on x, alone, that is, E(y|x;)? First, inte- 
grate out x, and obtain the joint marginal density of y and x;. Then proceed as in the 
one variable case. Note that E(y|x;) is not the same as E(y|x;, x, = 0). These two are 
two different statements and two different items. 

Again, note that for constructing the regression function of y on x,, ... , x,, that is, 
E(y|x;, ..., x4), we need only the conditional distribution of y given x, ...,x, and we do 
not require the joint distribution of y, x, ... , X,. 


Example 14.8. If the conditional density of y given x, ..., x; is given by 


y 
1 LLL Y 
=e "e, OSV < 00, 54X, + +X, >0 
5 Xp +X, ; z ; 1 k 
ΕΟΧ»... χε) = E. 
; elsewhere 


evaluate the regression function. 


Solution 14.8. The conditional density is the exponential density with the mean 
value 


Ε(|Χι»...» Χχ) - 5X + Xk 


and this is the regression of y on xX, ..., xy, which here is a linear function in Χι,...,Χχ 
also. 

When the variables are all jointly normally distributed also, one can obtain the 
regression function to be linear in the regressed variables. Example 14.8 illustrates 
that joint normality is not needed for the regression function to be linear. When the 
regression function is linear, we have some interesting properties. 


450 — 14 Model building and regression 


Exercises 14.2-14.3 


14.3.1. Prove that min, Ely - a| 2 a = median of y, where a is an arbitrary constant. 
Prove the result for continuous, discrete and mixed cases for y. 


14.3.2. Let 


c 
X, Ξ------» 1 <y< y (0) κχς 1, 
foc y) Game y «oo 
and zero elsewhere. If f (x, y) is a joint density function of x and y, then evaluate (i) the 
normalizing constant c; (ii) the marginal density of y; (iii) the marginal density of x; 
(iv) the conditional density of y given x. 


14.3.3. By using the joint density in Exercise 14.3.2, evaluate the regression of y on x 


and then predict y at x = 5. 


14.3.4. Consider the function 
fo y) soy, O<y<1,1<x<2, 


and zero elsewhere. If this is a joint density function, then evaluate (i) the normalizing 
constant c; (ii) the marginal density of x; (iii) the conditional density of y given x; (iv) 


the regression of y given x; (v) the predicted value of y at x - 5 


14.3.5. Let 
fO x»Xx3)2 CCL *Xx(*X;*x3, O< x, <1, 1=1,2,3 


and zero elsewhere be a density function. Then evaluate the following: (i) the nor- 
malizing constant c; (ii) the regression of x, on x;, x3; (iii) the predicted value of x, at 
Xp = ΣΧ = i (iv) the predicted value of x, at x; = 


=l 1 
πο 33 


14.4 Linear regression 


Let y,x;, ..., xy be real scalar random variables and let the regression of y on x4, ..., xy 
be a linear function in Χι,...,Χκ, that is, 


E(Y|X; ....Xy) = Bo + Bpa + +++ + Bux, (14.3) 


where fj, f, ..., Bj, are constants. If joint moments up to the second order exist, then 
we can evaluate the constants βρ,βι»...»βκ in terms of product moments. In order to 
achieve this, we will need two results from elementary statistics. These will be listed 
here as lemmas. These were given earlier but for ready reference these will be listed 
here again. 


14.4 Linear regression —— 451 


Lemma 14.1. 
E(u) = E, [E(ulv)] 


whenever the expected values exist. Here, E(u|v) is in the conditional space of u given 
v or computed from the conditional distribution of u given v, as a function of v. Then 
the resulting quantity is treated as a function of the random variable v in the next step 
of taking the expected value E,,(-). 


Lemma 14.2. 
Var(u) = Var[E(u|v)] + E|Var(u|v)] 


That is, the sum of the variance of the conditional expectation and the expected value 
of the conditional variance is the unconditional variance of any random variable u, 
as long as the variances exist. 


All the expected values and variances defined there must exist for the results to 
hold. The proofs follow from the definitions themselves and are left to the students. 
Let us look into the implications of these two lemmas with the help of some exam- 
ples. 


Example 14.9. Consider the joint density of x and y, given by 


1 z 
1 e7} 0-2-0? 


1 
fosy) = 4 Ve 
0, elsewhere. 


-œ <y < œ, 1€x«oo 


Evaluate the regression of y on x, and also verify Lemma 14.1. 


Solution 14.9. Integrating out y, one has the marginal density of x. Integration with 
respect to y can be effected by looking at a normal density in y with expected value 
2+x. Then the marginal density of x is given by 


poo | 1<x<0o 


O, elsewhere, 


because the joint density is the product of conditional and marginal densities. There- 
fore, the conditional density of y given x is normal with expected value 2+x and hence 
the regression of y on x is given by 


E(yl) =2+x (14.4) 


which is a linear function in x and well behaved smooth function of x. Expected value 
of the right side of (14.2) is then 


E(2+x)=2+ E(x) 


452 ---- 14 Model building and regression 


[9.4] 
-2.] Tdx 
1 X 


-24[Inx]f? = co. 
Thus the expected value does not exist and Lemma 14.1 is not applicable here. 


Note that in Lemma 14.1 the variable v could be a single real scalar variable or a 
collection of real scalar variables. But since Lemma 14.2 is specific about variance of a 
single variable, the formula does not work if v contains many variables. 


Example 14.10. Verify Lemma 14.1 for the following joint density: 


2, O<xsy<l 
fy) = 
O, elsewhere. 


Solution 14.10. Here, the surface z = f(x,y) is a prism sitting on the (x,y)-plane 
the non-zero part of the density is in the triangle 0 < x < y x 1. Thus the region can be 
defined as eitherO<x<yandO<y<lorx<y<iand0O<x <1. Marginally,O<x<1 
as well as 0 < y x 1. The marginal densities of x and y are respectively 


ρω ir T 20-x) O<x<1 
xX) = = 
; y=x í 0, elsewhere; 


y 2y, O<y<l 
fo) - | 2àx- 
x-0 O, elsewhere. 
Hence 
1 2 1 1 
E(y)- | γ(2γ)άγ- - and E(x)= | x[2 - x)]dx = =. 
0 3 0 3 


The conditional density of y given x is given by 


foy 2 1 
Roo 11-χ) 1-x 


χεγς1 


g(y|x) = 


and zero elsewhere. Note that when x is fixed at some point then y can only vary from 
that point to 1. Therefore, the conditional expectation of y, given x, 


1 2 
y 1-x 14x 
E = d = = 
να) jas Y"7a-x 2 


From this, by taking expected value we have 


E [Eyo] ο = 5i ο 


14.4 Linear regression ---- 453 


Thus the result is verified. [Note that when we take E, (-) we replace the preassigned x 
by the random variable x or we consider all values taken by x and the corresponding 
density or we switch back to the density of x and we are no longer in the conditional 
density.] 


Coming back to the linear regression in (14.3), we have 


E(yba, -.- Xx) = Bo + Bp + +++ + Bux. (14.5) 


Taking expected value on both sides, which means expected value in the joint 
marginal density of x,,...,x, and by Lemma 14.1, 


E[E(ylx, ....x,)] = EY) 
Elfo + Bii +--+ + Byxy] = Bo + βιΕίκι) 1 --: + βκΕ(Χκ). (14.6) 


Note that taking expected value of x; in the joint distribution of χι,...,Χι is equiva- 
lent to taking the expected value of x; in the marginal distribution of x; alone because 
the other variables can be integrated out (or summed up, if discrete) first to obtain 
the marginal density of x; alone. [The student may work out an example to grasp this 
point.] From (14.5) and (14.6), one has 


E(y|x,,--->X~) - EY) = By pa - Ε(κι)] + -+ By ly - EGG]. (14.7) 


Multiply both sides of (14.7) by x; — E(xj) for a specific j and then take expected value 
with respect to x, ..., x,. The right side gives the following: 


βι Οονίχι, χι) * B5 Cov(x2, χι) pe 
+ B; Var(x;) + ++ + Bi Οονίχχ,Χχ)) (14.8) 


because 


C 5X»), ifj 
E{[x; - Eœ], - E(,)]} = Fan ) 2 E 
j^ =r. 


The left side of (14.7) leads to the following: 
E{ [x - Eog)]EQ)] = E) (E[x; - Εαρ]]-ο 
since for any variable x;, E[x; - E(x;)] = 0 as long as the expected value exists. 


E{[x; - EQ) JEY ....x9] = EEO - E65) ba. -... x] 
= Ely(x; = E(x;))] 


454 — 14 Model building and regression 


since in the conditional expectation x,,...,x, are fixed and, therefore, one can take 
X% -E 0), being constant, inside the conditional expected value and write V(x; -E (κ) 


given x,, ..., xy. But 


Ely(xj - EG3))] = Cov(y. x) 


because for any two real scalar random variables x and y, 


Cov(x,y) = E((x - E(x))(y 


= E[x|y - E(y)]} = Ely[x - E(x)]} 


because 


E{E(x)[y - E(y)]} = EGOEly - E(y)} = 0 


- E(y))} 


since E[y - E(y)] = E(y) - E(y) = 0 and similarly E(E(y)[x -- E(x)]} = 0. Therefore, we 


have 


σι = By Oy + B205j + 


Ut BOG 


(14.9) 


where ση = Cov(x;,x;) and σι = Oy; = Cov(x;,y). Writing (14.9) explicitly, one has 


Ory = By Oy, + P209 + 


05, = By Oy Ἔβοσ» + 


Ow -βισµ + P20% + 
Writing in matrix notation, we have 
Σγ Ξ », 2S (ση) 


βι Oy 
β = : E Xy = : E Σ = 
Bx Oky 


oo [yog 


o [yo 


s Big 


σι σι 


σα Ok 


Oik 


Okk 


Note that the covariance matrix = is symmetric since ση = σῃ for all i and j. If X is non- 


singular, then 


— 5-1 
crum, 


(14.10) 


where ="! denotes the regular inverse of the covariance matrix or the variance- 
covariance matrix X. This notation X is another awkward symbol in statistics. This 
can be easily confused with the summation symbol Σ. But since it is very widely 
used, we will also use it here. Is Σ likely to be non-singular? If X is singular, then it 
means that at least one of the rows (columns) of X is a linear function of other rows 


14.4 Linear regression —— 455 


(columns). This can happen if at least one of the variables x, ... , x, is a linear function 
of the other variables. Since these variables are preassigned, and hence nobody will 
preassign one vector (x;, ..., xy) and another point as a constant multiple a(x;, ... xy) 
because the second point does not give any more information. Thus when the points 
are preassigned as in a regression problem, one can assume, without loss of gener- 
ality, that = is non-singular. But when X is estimated, since observations are taken 
on the variables, near singularity may occur. We will look into this aspect later. From 
(14.10) and (14.6), one has 


Bo =E) - βιξίκι) - «+: - βκΕ(χκ) 
= E(y) - P'E(X) (14.11) 


where a prime denotes the transpose 


βι E(x) 
β- : > Ε(Χ) = : 
Bx Είαι) 


From (14.10), we have for example, f, = ΣΟ. where = is the first row of X1, B= 
ΣΌΣ, where =") is the j-th row of Σ 1 forj =1,...,k. 

Instead of denoting the variables as y and x, ..., x, we may denote the variables 
simply as x,,...,x, and the problem is to predict x, by preassigning χ»,...,Χχ. In this 
notation, we can write the various quantities in terms ofthe sub-matrices of X. For this 
purpose, let us write 


σι Op ... Oy 
σοι σα = σκι κ η 
T * * * m H 
Xa X» 
σα Op .:: Ok, 
ó 02) On 
21 
0. e 0 
! : 32 3k 
XA = Xp = " > X» = : : Pi (14.12) 
Ou ‘ vs : 
Ok2 ..—. Okk 


Then the best predictor, best in the minimum mean square sense, for predicting χι at 
preassigned values of x,, ...,x; is given by Ε(ΧΙΙΧ2,...,Χι) and if this regression of x, 
On X5, ..., Xj, is linear in x, ..., x, then it is of the form 


EQ x, ... X4) = Ag + 42χ2 + t AX, (14.13) 
where a5, 5, ..., αχ are constants. Then from (14.10), 


a) 
a=]: | ΞΣ2Ση (14.14) 
a 


456 — 14 Model building and regression 


and the regression of x, on χ»,...,Χχ or the best predictor of x, at preassigned values 
of X>, ..., Xy, when the regression is linear, is given by 


a x 
E(QGglx,...,x) =Ag αχ», a2| :|, X=}: |. (14.15) 
αι χι 
For example, when k = 2 
Cov(X,, X2) 
E ΞΕ τω AE 
(44 bo) =E) + Var(x,) (x - E(x2)) 
0,0: 
= py + 2%, — uy) 
05 
σ 
=u +p% - p) (14.16) 
0 


where E(x) = μι, E(x) = My, Var(x,) = of, Var(x;) = σ2 and pis the correlation between 
χι and x. This is a very useful result when we consider a linear regression of one real 
scalar variable on another real scalar variable. 

Let us compute the correlation between x, and its best linear predictor, that is, 
between x, and the predicting function in (14.15). 


Example 14.11. If X' = (x,,x>,x3) has the mean value, 
E(X') = (EG), EGG), E(3)) = (2,1, -1) 


and the covariance matrix 


2 -1 0 
CovX)2|-1 2 1|, 
0 1 1 


construct the regression function for predicting x, at given values of x, and x3, if it is 
known that the regression is linear in x5, x;. 


Solution 14.11. As per our notation, 
-1 2 1 
Egy204722 X5-2(-10) Σο | | X»- | | 
and hence 


E 1 -1 2 1 
k 2 ΠΝ 


Hence the best predictor is 


EGG o. x5) = ΕΙ) + 2X93 (Xp - Ε(Χ2)) 


14.4 Linear regression —— 457 


zi 
=2+(-1,1) [e | 
X3*1 


=2-X,+1+X;+1=4-X +X 


is the best prediction function for predicting x, at preassigned values of x, and x3. 


14.4.1 Correlation between x, and its best linear predictor 


In order to compute the correlation between x, and E(x,[x5, ... , x), we need the vari- 
ances of these two quantities and the covariance between them. As per our notation 
in (14.12), we have 


Varo) = 03. (14.17) 


The variance of a, + a! X can be computed by using the result on variance of a linear 
function of scalar variables. These will be stated as lemmas. These follow directly from 
the definition itself. 


Lemma 14.3. Consider a linear function of real scalar random variables y, ... ,y,, 
with covariances, Cov(y;,yj)=Vj, Lj 1, ...,n thereby vj;- Var(y;) and let the vari- 
ance-covariance matrix in (yj, ... ,Yn) be denoted by V = (vj). Let 


U = do + AY; + dy? t + AyVn =o * a! Y 


y = bo + byy, + Doyo + +++ + by, = bg + b'Y 


where 


and a prime denotes the transpose. Then 
Var(u) -a'Va, Var(v)- b'Vb, Cov(uv)-a'Vb = Ρ' Va. 


Note that V is a symmetric matrix. Then with the help of Lemma 14.3, we have 


Var[E(x xs, ...,x,)] = Var[ag  a' X] = Var[a' X] 
= a! Cov(X)a = a £a (14.18) 


where Cov(X) means the covariance matrix in X. But from (14.14) and (14.18), the vari- 
ance of the best linear predictor 


a! Σας [Σ21Σ21] Σ24[Σ22Σ21] = ΣΙΩΣ22Σ21 (14.19) 


458 ---- 14 Model building and regression 


because 22, = £5, X), = Σι». The covariance between x, and its best linear predictor is 
then 


(ον[χι, E(x;lx5, ...,x,)] = Cov[x,, ag + aX] 
= Cov[x,, a! X] = a' Cov(x,,X) = a! Σι 
= [Σι] Σχι-ΣιΣ22Σ. 


Strangely enough, the covariance between x, and its best linear predictor is the same 
as the variance of the best linear predictor. Variance being non-negative, it is clear 
that the covariance in this case is also non-negative, and hence the correlation is also 
non-negative. Denoting the correlation by p, 2. κ, we have 


. ου Σρι)’ _ ΣΙΣ22Σ2ι 


= a ος. (14.20) 
04 (259593) σι 


2 
P1.Q...k) 


But note that the best predictor E(x; x5, ... , x,.) for predicting x, at preassigned values 
of x, ..., x, need not be linear. We have cited several examples where the regression 
function is non-linear. But if the regression is linear then the correlation between x, 
and its best linear predictor has the nice from given in (14.20). 


Exercises 14.4 


14.4.1. Write the following linear functions by using vector, matrix notations. For ex- 
ample, 4 + x, - x; + 5x4 = 4 + a! X = b'Y where the prime denotes the transpose and 


1 

1 χι 1 5 
a-|-1|, Χ-|χ;|, b= Eu sy 
5 x = n 

2 5 χ 


(i) y=24+x,+%-x3 (d) y21«2x«-x,5 (i) y=5+xX +X- 2X +X4 


14.4.2. Write down the following quadratic forms in the form Χ’ ΑΧ where A =A’: 
G) x+ 3x5 
Gii) 2x? + x3 — x9 + 2x4x; — X3X3; 


Gii) x? 4324 aR. 


14.4.3. Write the same quantities in Exercise 14.4.2 as X' AX where A z A'. 


14.4.4. Can the following matrices represent covariance matrices, if so prove and if 
not explain why? 


14.5 Multiple correlation coefficient pao. ---- 459 


1 -1 d 310 2 1 
A,=|-1 2 Οἱ, As=|1 2 2|, Αξίι -3 1 
1 0 4 ο 2 2 0 1 


14.4.5. Let Χ’ = (χι, X2, x3) have a joint distribution with the following mean value and 
covariance matrix: 


1 1 1 -1 
E(X)=|-1|, CovX)-V-2|1 3 O 
2 -1 ο 2 


Let the regression of x, on x; and x; be a linear function of x, and x3. (i) Construct 
the best linear predictor E(x,|x>,x3) for predicting x, at preassigned values of x; 
and χη; (ii) predict x, at x; = 1, x3 = 0; (iii) compute the variance of the best predictor 
EQ|x5,x3); (iv) compute the covariance between x, and the best linear predictor; 
(v) compute the correlation between x, and its best linear predictor. 


14.5 Multiple correlation coefficient p42... 


The multiple correlation coefficient is simply defined as 


Σ2Σ21Σ 
Ρι 0... = m a (14.21) 


with the notations as given in (14.12). It does not mean that we are assuming that there 
is a linear regression. If the regression is linear, then the multiple correlation coeffi- 
cient is also the correlation between x, and its best linear predictor. The expression in 
(14.20) itself has many interesting properties and the multiple correlation coefficient, 
as defined in (14.21), has many statistical properties. 


Example 14.12. For the same covariance matrix in Example 14.11, compute ρίϱ3 


Solution 14.12. We need to compute ---- , out of which Σ12Σ22 is already computed 
in Example 14.11 as Σ12Σ22 = (71,1), and Σχ = (τὰ). Therefore, 


X xix 1 -1 1 
2 12722721 
p χε m -1,1 XA 
M3) σιι 2ί ) ( 0 ) 2 


14.5.1 Some properties of the multiple correlation coefficient 


Let bX +--+ + byxy = b' X; where b! = (bp, ..., b) and Xj = (χ»,...,Χκ) be an arbitrary 
linear predictor of χι. That is, x, is predicted by using b'X,. Let us compute the cor- 
relation between x, and this arbitrary predictor Ρ’ X;. Note that, from Lemma 14.3 we 


460 —— 14 Model building and regression 


have 
Var(bD'X;)- b'Zj;b and  Cov(x,b'X,)- b'24. 


Then the square of the correlation between x, and an arbitrary linear predictor, de- 
noted by η”, is given by the following: 
Pm (01Σ1)7 
(b' Z,,b)oq 


But from Cauchy-Schwarz inequality we have 
2 LN 2 = 
(b'Z4)' = [(b' 25) (Zaz Z3] Ξ['Σ»6][ΣαΣ22Σ21] 
1 
where 22, is the symmetric positive definite square root of £. Hence 


Die (b' 2n)? " [b' Zb] [Epiha] _ ΣΣ Σρι. 


< (14.22) 
(b'Zyb) oy, [b'Z yb] oy, 03 


In other words, the maximum value of n? is pj. ο the square of the multiple correla- 
tion coefficient given in (14.20). 


Result 14.3. Multiple correlation coefficient of x, on (x5, ..., xy), where X4, X2, ... Xy 
are all real scalar random variables, is also the maximum correlation between x, and 
an arbitrary linear predictor of x, based on x>, ... ,x,. 


Note 14.3 (Cauchy-Schwarz inequality). Let a,,...,a, and b,,...,b, be two se- 
quences of real numbers. Then 


n n 1 n 1 
λαιδις ba bu (N14.1) 

i=l i=l i=l 
and the equality holds when (a, ..., ἄμ) and (5,, ..., b) are linearly related. In terms 
of real scalar random variables x and y, this inequality is the following: 


[Cov(x, y)| < [Var(x)] : [Var(y)] 2 (N14.2) 


and the equality holds when x and y are linearly related. 


Proof is quite simple. (N14.1) is nothing but the statement | cos 0| x 1 where 0 is the an- 
gle between the vectors å = (αι,...,αχ) and b= (b,,...,b,) and (N14.2) is the statement 
that |p| < 1 where p is the correlation coefficient between the real scalar variables x and 
y. Thus the covariance as well as correlation between the real scalar random variables 
x and y can be described as measuring cos 0 where 0 is measuring angular dispersion 
between x and y or scatter in the point (x, y). There are various variations and exten- 
sions of Cauchy-Schwarz inequality. But what we need to use in our discussions are 
available from (N14.1) and (N14.2). 


14.5 Multiple correlation coefficient pao. —— 461 


Note 14.4 (Determinants and inverses of partitioned matrices). Consider a matrix 
A and its regular inverse A !, when A is non-singular and let A be partitioned as 
follows: 


A A An A’ 
a=] Ἡ al asl | (N14.3) 


where A, Αι», Αγι, A» are submatrices in A and A”, A’, A”, A? are submatrices 
in 47. For example, let 


where let 


0 i d 
Ay = [2], Ay = [0,1], An E E ? Án z l E] ; 


Then from elementary theory of matrices and determinants we have the following, 
denoting the determinant of A by |A|: 


lAls lAullAz -AAA p| iflAgu| 0 (N14.4) 
=|Ay||Ay -A5:4531A4,| if [Ax] 40. (N14.5) 


For our illustrative example, the determinant of Αμ is |A,,| = 2 and 


E 1 1 0|1 
ΙΑ. - Ay AW Aj) = | | z | | 510, | 


14 22 
{τι 0 0 
μι -ι 0 1 
1 1 
= =-3 
1 -2 


and, therefore, 
ΙΑιιΑ - AnAi]Ap| = (2)(-3) = -6. 


The student may evaluate the determinant of A directly and verify the result and as 
well as use (N14.5) and verify that result also. The proof for establishing (N14.4) and 
(N14.5) are quite simple. From the axioms defining a determinant, it follows that if 
linear functions of one or more rows are added to one or more rows the value of the 
determinant remains the same. This operation can be done one at a time or several 
steps together. Consider (N14.3). What is a suitable linear function of the rows con- 
taining A,, to be added to the remaining rows containing Αχ so that a null matrix 


462 — 14 Model building and regression 


appears at the position of A,,. The appropriate linear combination is obtained by a 
pre-multiplication by 


-AnA [Au Αρ] = [-An Ay AT Ap] . 
Hence the resulting matrix and the corresponding determinant are the following: 


Ay An 
Ay Ay 


|A| = 


This is a triangular block matrix and hence the determinant is a product of the deter- 
minants of the diagonal blocks. That is, 


ΑΙ - |41||42» - Ay, ΑΠ A1]. 


If A“! exists then AA! = I = A 1A. In the partitioned format in (N14.3), 


xag cy. Am 38 Α A?]_ [E ο] 
An Ags | EA A7 | [0 Te 


Thus, by straight multiplication the following equations are determined where r and 
s, denote the orders of the identity matrices where we assumed that A4; is r x r and A, 
is sx s. 


AnA" ΓΑΙΑ” =I, 
AnA” +AA?” - 0 
A,A" +AA” =O 
Ay, A” + AA? =. (N14.6) 


Solving the system in (N14.6), we have the following representations, among other 
results: 


A"-[Au-ApAzAa]. Aq =A" -AP (AP) A” 

A? SA; SAQAHAGI. :. Ad =A? APT) AP (N14.7) 
Note that the submatrices in the inverse are not the inverses of the corresponding sub- 
matrices in the original matrix. That is, ΑἹ + A], A? + A34. From (N14.6), one can also 


derive formulae for A?! and A” in terms of the submatrices in A and vice versa, which 
are not listed above. [This is left as an exercise to the student.] 


For our illustrative example, it is easily verified that 


1 1 -i 7 AH A’ 
AM SS) 2 1 -| 21 2| 
A^ A 
1 1 - 


14.5 Multiple correlation coefficient o,; ϱ —— 463 


where 


From the computations earlier, we have, for example, 


E 1 1 
A» -AxAg Ai σι l =| 


and hence 


-1 
£ -3 jl 1 112 1 
[Ax - AnAi Aj] = | E = 3 E A =A”. 


Thus one result is verified. The remaining verifications are left to the students. 


Note 14.5 (Correlation coefficient). This is very often a misused concept in applied 
statistics. The phrase “correlation” indicates “relationship”, and hence people mis- 
interpret it as a measure of relationship between two real scalar random variables 
and the corresponding sample value as measuring the relationship between the 
pairs of numbers. There is extensive literature trying to evaluate the strength of the 
relationship, “negative relationship”, “positive relationship”, “increasing and de- 
creasing nature of the relationship” etc by studying the correlation. But it is very 
easy to show that correlation does not measure relationship at all. Let p denote the 
correlation between two real scalar random variables x and y. Then it is easy to 
show that —1 x p <1. This follows from Cauchy-Schwarz inequality or by using the 
property |cos0| < 1 or by considering two random variables: 


where x and y are non-degenerate random variables with Var(x) = 02 > 0 and 
Var(y) = 03 > 0. Take Var(u) and Var(v) and use the fact that they are non-negative 
to show that -1 < p x 1. In the Cauchy-Schwarz inequality, equality is attained or 
the boundary values p = +1 and p = —1 are attained if and only if x and y are linearly 
related, that is, y = a + bx where a and b + 0 are constants. This relationship must 
hold almost surely meaning that there could be non-linear relationships but the 
total probability measure on the non-linear part must be zero or there must exist 
an equivalent linear function with probability 1. This aspect will be illustrated 
with an example later. Coming back to p, let us consider a perfect mathematical 


464 ---- 14 Model building and regression 


relationship between x and y in the form: 


y=at+bx+cx’, c#0. (N14.8) 


Since we are computing correlations, without loss of generality, we can omit a 


Further, for convenience let us assume that x has a symmetrical distribution so that 
all odd moments disappear. Then E(x) = 0, E(x?) = 0 and E(y) = a + cE(x?). Also we 
rule out degenerate variables when computing correlation, and hence it is assumed 


that Var(x) # 0. 
Cov(x,y) = E{x[y - E(y)]} = E{x[bx + c(x? - E(x?))]] 
= 0+ bVar(x) + 0 = bVar(x). 
Var(y) = E[bx + c(x? - E(x?))]? 
= b? Var(x) + c^(E(x^) - [E(x?)]?}. 
Then 


= bVar(x) 
Varo Jb? Var(x) + AE (x4) - (Ε(κ2}2] 
b 
E" Jg δα) ΕΕΣ ᾽ forb#0 
Var(x) 
-0 ifb=0 
1 . 
γί: +% c E(x^) EEG Y ifb»50 
Var(x) 
| iin (N14.9) 
NT E(x*)-(EQG2))? 
ου 


Let us take x to be a standard normal variable, that is, x ~ N(0,1), then we know that 
E(x) = 0, E(x’) - 1, E(x^) = 3. In this case, 


Ξ 8 ! 
\1+25 


positive if b > 0, negative if b «0 and zero if b = 0. Suppose that we would like to have 
p = 0.01 and at the same time a perfect mathematical relationship between x and y, 


such as the one in (N14.8) existing. Then let b » 0 and let 


(N14.10) 


1 c 1 
=0.01 => 1+2% m =10000 => 
γι $25 b? (0.01? 
b 
2 2 
2Č =9999 2 „2292P , 


14.5 Multiple correlation coefficient pao.. —— 465 


Take any b » 0 such that c? - ae There are infinitely many choices. For example, 
b=1gives c? = 555, Similarly, if we want p to be zero, then take b = 0. If we want p to 
be a very high positive number such as 0.999 or a number close to —1 such as -0.99, 
then also there are infinitely many choices of b and c such that a perfect mathematical 
relationship existing between x and y and at the same time p can be any small or large 
quantity between -1 and +1. Thus p is not an indicator of relationship between x and 
y. Even when there is a relationship between x and y, other than linear relationship, 
p can be anything between -1 and +1, and p = +1 when and only when there is a lin- 
ear relationship almost surely. From the quadratic function that we started with, note 
that increasing values of x can go with increasing as well as decreasing values of y 
when p is positive or negative. Hence that type of interpretation cannot be given to p 
either. 


Example 14.13. Consider the following probability function for x: 


3, x-a 
1 
f00-15 x=-a 
O, elsewhere. 


Compute p and check the quadratic relationship between x and y as given in (N14.8). 


Solution 14.13. Here, E(x) = 0, E(x?) = a’, E(x^) = a^. Then p in (N14.9) becomes 


b 
p---H 
|b| 
but c + 0 thereby (N14.8) holds. Is there something wrong with the Cauchy-Schwarz 
inequality? This is left to the student to think over. 


Then what is the correlation coefficient p? What does it really measure? The nu- 
merator of p, namely Cov(x, y), measures the joint variation of x and y or the scatter of 
the point (x, y), or angular dispersion between x and y, corresponding to the scatter 
in x, Var(x) = Cov(x, x). Then division by 4/Var(x) Var(y) has the effect of making the 
covariance scale-free. Hence p really measures the joint variation of x and y ora type 
of scatter in the point (x, y) in the sense that when y - x it becomes Var(x) which is the 
square of a measure of scatter in x. Thus p is more appropriately called a scale-free 
covariance and this author suggested through one of the published papers to call p 
scovariance or scale-free covariance. It should never be interpreted as measuring re- 
lationship or linearity or near linearity or anything like that. Only two points p = +1 
and p - -1 are connected to linearity and no other value of p is connected to linearity 
or near linearity. For postulates defining covariance or for an axiomatic definition of 
covariance the student may consult the book [14]. 


466 —— 14 Modelbuilding and regression 


Exercises 14.5 


14.5.1. Verify equations (N14.4) and (N14.5) for the following partitioned matrices: 


1-102 

LR REN RE M E 
2. dp» a Ie «ἂν 3 4 
1.0 1 1 
1 0 1 -1 

2 - 0 B B 

B- = 11 12 : B,- [1] 
1 -1 3 1| |B; B 
-1 0 1 4 


14.5.2. For real scalar random variable x, let E(x) = 0, E(x?) = 4, E(x?) = 6, E(x^) = 24. 
Let y = 50 +x +cx°, c + 0. Compute the correlation between x and y and interpret it for 
various values of c. 


14.5.3. Let x bea standard normal variable, that is, x ~ N(0,1). Let y = a+x+2x?+cX?, 
c #0. Compute the correlation between x and y and interpret it for various values of c. 


14.5.4. Let x be a type- beta random variable with the parameters (a = 2, p = 1). Let 
y = ax? for some parameters a and δ. Compute the correlation between x and y and 
interpret it for various values of a and 6. 


14.5.5. Repeat the Exercise in 14.5.4 if x is type-2 beta with the parameters (a = 1, f = 2). 


14.6 Regression analysis versus correlation analysis 


As mentioned earlier, for studying regression one needs only the conditional distri- 
bution of x, given x, ... , xy, because the regression of x, on X, ..., x, is the conditional 
expectation of x, given x, ... , Xy, thatis, Ε(χη|χ»,...»Χχ). But for correlation analysis we 
need the joint distribution of all the variables involved. For example, in order to com- 
pute multiple correlation coefficient p; 2. we need the joint moments involving all 
the variables x4,x5, ...,x, up to second-order moments. Hence the joint distribution, 
not just the conditional distribution, is needed. Thus regression analysis and correla- 
tion analysis are built up on two different premises and should not be mixed up. 


14.6.1 Multiple correlation ratio 
In many of our examples, it is seen that the regression function is not linear in many 


situations. Let E(x; x, ..., xy) = M(5,..., xy), may or may not be a linear function of 
Xz, ...,Xj. Consider an arbitrary predictor g(x,,...,x;,) for predicting χι. To start with, 


14.6 Regression analysis versus correlation analysis —— 467 


we are assuming that there is a joint distribution of x4, X2, ... Xx. Let us compute the 
correlation between x, and an arbitrary predictor g(x, ... ,x,) for χι. 


Cov(x,, g) = E([x, - Eœ )][g - Etg)]) = Ε{χι[( - E(g)]] (14.23) 


as explained earlier since E{E(x,)[g - E(g)]} = E(x,)Elg - E(g)]} = 0. Let us convert the 
expected value in (14.21) into an expectation of the conditional expectation through 
Lemma 14.1. Then 


Cov(x,,g) = E{E[x (8 - E(2)]bo. ... x1] 
-E((g-E(g))EGubo,....xQ)) since g is free of x, 
= E{(g esi E(g))M(x, ee Xi} 
= Cov(g, M) x Var(g) Var(M), 
the last inequality follows from the fact that the correlation p < 1. Then the correlation 


between x, and an arbitrary predictor, which includes linear predictors also, denoted 
by ή, is given by the following: 


. (ονί(χι, 2) 7 JVar(g)yVar(M) _ JVar(M) 
T7 Wa Var) Wae Na) Waa) 


(14.24) 


Definition 14.4 (Multiple correlation ratio). The maximum correlation between x, 
and an arbitrary predictor of x, by using x, ..., x; is given by the following: 


_ NVar(M) — 
s = War) κ. Πι ϱ...Ο: 


This maximum correlation between x, and an arbitrary predictor of x, by using 


(14.25) 


X5, ... Xj is given by 4 vum and it is defined as the multiple correlation ratio Πιο.) 
and the maximum is attained when the arbitrary predictor is the regression of x, on 
X5, ..., X, namely, E(x;|x5, ..., Xy) =M, ..., Xy). 


Note that when M(x, ..., xy) is linear in x3, ..., xy we have the multiple correlation 
coefficient given in (14.21). Thus when g is confined to linear predictors or in the class 
of linear predictors 

_ ΣΩΣ22Σ2Ι 


Tl o...) = fio...) DE σος (14.26) 
u 


Some further properties of p? can be seen easily. Note that from (N14.7) 


Oj-XoXgIa (07)! 1 


= a 14.27 
σι 03 0,0"! l } 


2 
1-pio.17 


468 —— 14 Model building and regression 


Example 14.14. Check whether the following matrix X can represent the covariance 
matrix of X' = (X1, X2, x3) where χι, X2, x; are real scalar random variables. If so, evaluate 
Ρι 23) and verify (14.27): 


2 0 1 
EZ2|0 2 -1 
1 -1 3 
Solution 14.14. 
2 0 
250, | |-4»o. |x| 2820 
0 2 


and hence Σ = X > 0 (positive definite) and hence it can represent the covariance ma- 
trix of X. For being a covariance matrix one needs only symmetry plus at least positive 
semi-definiteness. As per our notation, 


2 


05-72, Σι [0,1], τν |, 


-1 
i XM 
3 | 213-312 


and, therefore, 


Oy (5)(2) 1 2111 
1. 2 
τς = 123) 
1 4 
1-pio»-1- 5^5 
σι -XpY3Ea — ;p =| d 
On 9 5] 5 


Thus (14.27) is verified. 


14.6.2 Multiple correlation as a function of the number of regressed variables 


Let X' = (xı, ...,x,) and let the variance-covariance matrix in X be denoted by X = (σι). 
Our general notation for the multiple correlation coefficient is 


_ [24225921 
Ρι 0... = O0 . 


14.6 Regression analysis versus correlation analysis 


For k - 2, 


2 
PS 932912 _ | OD Spe 
127 = = 
077011 01102) 
is the correlation between x, and x. For k = 3, 


-1 
1 05 σ σ 
pi ER. [o σ ] 22 23 21 : 
1.(2.3) 12> 013 
σι 032 033 031 


Converting everything on the right in terms of the correlations, that is, 


TES -- "T 
053 = 02 = P129107, 05-701, 075-05, 


031 = 033 = 9130105, 055 = 530505, 


we have the following: 


-1 
1 02 ρηῃσ)σ ρμσισ. 
2 2 230203 126102 
P1.23) = gz 12192913193] | | | 
1 


230205 03 P3391 93 
2 -1 
σ ϱ230203 P1202 
= [91202,1303] | - | | | 
230203 03 P1305 
= 
" 0, Oj||o, O 
= [015.913] i A É 9; 


-1 = 
" | 1 a κ 0 | E 0 | Vs 
ρα 1 0 σι O0 03} [P13 


ES 
1 ps ρι 
[P12>P13 ps 1 


1 1 ^ 
= 5 [Pi 13] | a μη, 1- p33 > 0, 
= -P3 


1-p3 1 ji? 
- pio + pis = 20 12P 13P 23 : 
1-5 
Then 
2 2 pi + pis = 2012P13P23 2 
P1423) TPR = 2 ΡΊ2 
1-/ 
— Pis -2ρῃΡι907; + ΡΊ»0 
1- pj, 


(P13 -Ραβον)᾽ 50 
1— pz 


— 469 


470 — 14 Model building and regression 


which is equal to zero only when p; = P17) 3. Thus, in general, 


2 2 2 
Pio3-Pp20 > pios 2 pp. 


In other words, the multiple correlation coefficient increased when we incorporated 
one more variable x, in the regressed set. It is not difficult to show (left as an exercise 
to the student) that 


io fios) S P1934) € (14.28) 


This indicates that fio... w 15 an increasing function of k, the number of variables in- 
volved in the regressed set. There is a tendency among applied statisticians to use 
the sample multiple correlation coefficient as an indicator of how good is a linear re- 
gression function by looking at the value of the multiple correlation coefficient, in the 
sense, bigger the value better the model. From (14.28), it is evident that this is a falla- 
cious approach. Also this approach comes from the tendency to look at the correlation 
coefficient as a measure of relationship, which again is a fallacious concept. 


Exercises 14.6 


14.6.1. Show that pj 55) < p1 234 With the standard notation for the multiple correla- 
tion coefficient pj o. ο: 


14.6.2. (i) Show that the following matrix V can be a covariance matrix: 


1 10-4 

1 1 0 
V= 3 

0 1 3 1 

-1 O 1 2 


(1) Compute p75, P23)» P1234) 
(iii) Verify that p?» < P? 03) € P1234) 


14.6.3. Let the conditional density of x, given x, be Gaussian with mean value 1 2x, + 
x5 and variance 1 and let the marginal density of x, be uniform over [0,1]. Compute 
the square of the correlation ratio of x, to χ), that is, 


m Var(M) 
12 Var(x;)’ 


M = E(x |x) 


and 


Var(x,) = Var[E(x,|x>)] + E[Var(x,|x,)]. 


14.6 Regression analysis versus correlation analysis ---- 471 


14.6.4. Let the conditional density of x, given x, x, be exponential with mean value 
1+xX +X; + X2X; and let the joint density of x, and x; be 


f0G,x32Xx,*X4, OSX, <1,0<5x3<1 


and zero elsewhere. Compute the square of the correlation ratio 


2 Var(M) 
1123) = Var(x,) 
where 
M = Ε(ΧΙΙΧ2,Χ) 
and 


Var(x,) = Var[E(x,|x>,x3)] + E[Var(x; x5, x3)]. 


14.6.5. Let the conditional density of x, given x,, x4 be Gaussian, N(x, + x3 + XX3,1), 
where let xy, x; have a joint density as in Exercise 14.6.4. Evaluate the square of the 
correlation ratio η] ϱ9. 

There are other concepts of partial correlation coefficient, partial correlation ratio, 
etc., which fallin the general category of residual analysis in regression problems. We 
will not go into these aspects here. These will be covered in a module on model build- 
ing. We will conclude this section with a note on variances and covariances of linear 
functions of random variables. These were already discussed in Module 6, which will 
be recalled here for ready reference. 


Note 14.6 (Variances and covariances of linear functions). Let %,... qe be real 


scalar variables with E(xj) - Hj, Var(x;) = OF, Cov(x;; κι) = 0j,1,] — 1, ..., p. Let 


σι m σι 


χι a, bi * P 
. . . 21 n 2 
X=|: |], a=]: ], b=]: |, = (dy) = d 
X a b 
p p Dp 
Oni. το PR 


where a and b are constant vectors, a prime denotes the transpose, E denotes the 
expected value and let j' = (μι, .... Mp), Hj = Είὰ)),] =1,...,p. As per the definition, 


Var(x;) = E[(x; - Εα)), 
Cov(x;,x;) = Ε[ίαι - ED); - EGg))]; Covo, x;) = Var). 


Then 


472 — 14 Model building and regression 


νΞΡΙΧΙ +- + bX, -b'X-X'b; E(a'X)-a'E(X) - a'u 
E(b'X)-b'E(X)-b'u  Var(a'X) = E[a'X - a'u} = Ela’ (X -j)]^. 


From elementary theory of matrices, it follows that if we have a 1 x 1 matrix c then itis 
a scalar and its transpose is itself, that is, c' = c. Being a linear function, a’ (X -- μ) is 
a1x1matrix and hence it is equal to its transpose, which is, (X -- u)’ a. Hence we may 
write 


E[a' (X -pf = E[a' (x - j)(x - p)! a] 2 a' E((X - p(X - j)' ]a 
since a is a constant the expected value can be taken inside. But 


E[(X -uX -μ)'] 


68-1» (χι -μι)ίχ;-μ2) ... (Αι — Hy) %p - Hp) 
-E 66 - μο)ίΧι - μι) (x; - Ha) e 09 7 u3)06 μμ) 
(Xp -μρ)ίχι -µι) (Αγ --μρ)ίὰ; -Hu) + (Xy - Hp)? 


Taking expectations inside the matrix, we have 


Og σι . σι 
0. 0: e. OF 

E(X-mX-p-|? P? F 
Op Op) -.. Opp 


=} = covariance matrix in X. 
Therefore, 
Var(a'X)-a'xa, Var(b'X)-b'Xb, Cov(a'X,b'X)-a'xb-b'xa 
since Y=’. 


Details on variances of linear functions and covariance between two linear func- 
tions are needed to deal with the area of Canonical Correlation Analysis. This is an 
area of predicting one set of variables by using another set of variables. In the regres- 
sion problem that we considered in Sections 14.3-14.5, we were predicting one scalar 
variable by using one or more or one set of other scalar variables. We can general- 
ize this idea and try to predict one set of scalar variables by using another set of scalar 
variables. Since individual variables are contained in linear functions, what is usually 
done is to maximize the correlation between one arbitrary linear function of one set of 
variables and another arbitrary linear function of the other set of variables. By max- 
imizing the correlations, we construct the optimal linear functions, which are called 
pairs of canonical variables. This aspect will be dealt with in detail in the module on 
model building. 


14.6 Regression analysis versus correlation analysis ---- 473 


Another useful area is vector and matrix differential operators and their uses in 
multivariate statistical analysis. When estimating or testing hypotheses on the param- 
eters in a multivariate statistical density, these operators will come in handy. Since the 
detailed discussion is beyond the scope of this book, we will just indicate the main 
ideas here for the benefit of curious students. 


Note 14.7 (Vector and matrix derivatives). Consider the following vector of partial dif- 
ferential operators. Let 


2 of 

γι a oy, a oy; 

Y= : , AI ES : > =I) = : 
Yn Oy, Oy, 


where f is a real-valued scalar function of Y. For example, 
fí((Y)2ay,*--*aQ,y,-a'Y, a! -(a,...,.a,) (i) 


is such a function, where a is a constant vector. Here, f, is a linear function of Y, some- 
thing like 


2Y1 -Y2 +Y3; YitYatostYm γι 3yo -Y3 * 2ya 
etc. 
Εν...) =Y? +y3 + Y (ii) 


which is the sum of squares or a simple quadratic form or a general quadratic form in 
its canonical form. 


fiy Sy) = Y' AY, A (aj) - A' (iii) 


is a general quadratic form where A is a known constant matrix, which can be taken 
to be symmetric without loss of generality. A few basic properties that we are going to 
use will be listed here as lemmas. 


Lemma 14.4. 


f2aY > =a. 


Note that the partial derivative of the linear function a'Y = ayy, +--+ + @,Y,, with 
respect to y; gives a; for j=1,...,n, and hence the column vector 


af, awy) |^ 
2Υ oY 


For example, if a' Y = y, — y; + 2y, then 


474 —— 14 Model building and regression 


1 
Say) = 4 
2 
Lemma 14.5. 
γι 
bsyeexyl-Y'Y = 9h -9γ-2 : 
oY " 
n 


Note that Y'Y isa scalar function of Y whereas YY' isanxn matrix, and hence it 
is a matrix function of Y. Note also that when Y'Y is differentiated with respect to Y 
the Y' disappears and a 2 comes in. We get a column vector because our differential 
operator is a column vector. 


Lemma 14.6. 
ofz = 


=Y'AY, (A= Al 
f ay 


2AY. 


Here, it can be seen that if A is not taken as symmetric then instead of 2AY we will 
end up with (A +A’)Y. As an illustration of f;, we can consider 


fs = 2y? ἐγ} + Y3 - Wir + 5Y 23 


2 -1 Οἰ!γι 

-Duysydl|-31 1 ài|l|y|-Y'AY, A-A' 
0 5 1] [y3 
2 -2 Οἰ!γι 

ο) pos 1 5ļļ|y |=Y'BY, Bee. 
o 0 1jl|y 


In the first representation, the matrix A is symmetric whereas in the second represen- 
tation of the same quadratic form the matrix B is not symmetric. By straight differen- 
tiation, 


of; of; ofz 
= 4y; — 2y>, =2y,-2y,+5y3, = = 2y; + 5y: 
ay, 1 2 oy; 2 1 3 ay; 3 2 
Therefore, 
7; 4 2 
oy γι - 22 
of. af. 
τος B». = | 2y3 - 2y, + 5y3 
p 273 + 5y; 
oy; 
2 -1 0ļ||y 
=2|-1 1 2112 -2ΑΥ 
(0) 3 1 γ3 


14.7 Estimation of the regression function —— 475 


But 
2 -2 0 2 0 0 
B+B'=|0 1 5|«|-2 1 0 
ο ο 1 ο 51 
4 -2 0 2 -1 0 
=|-2 2 5|-2]-1 1 3|-2A 
ο 5 2 0 3 1 


Note that when applying Lemma 14.5 write the matrix in the quadratic form as a sym- 
metric matrix. This can be done without any loss of generality since for any square 
matrix B, 1(B + B') isa symmetric matrix. Then when operating with the partial differ- 
ential operator 3 on Y' AY, A = Α’ thenet resultis to delete Y’ (not Y) and premultiply 
by 2 or write 2AY. 

With the help of Note 14.7, one can now evaluate the pairs of canonical variables 
by using the vector and matrix differential operators. When we consider linear func- 
tions u = αιχι + s + AmXm = a X, v = by, o b Yn = b' Y, where a’ = (a,,..., am)» 
X! = Qa, ..., X4), b! = (b. b), Y! --(γι»....γῃ). Then Var(u) = a’ 24a, Var(v) = b' £b 
where 24 and Σ» are the covariance matrices in X and Y, respectively. Since a and b 
are arbitrary, Var(u) and Var(v) can be arbitrarily large and hence when maximizing 
the covariance between u and v confine to unit hyperspheres or put the conditions 
Var(u) = 1 and Var(v) - 1. Construction of canonical variables is left as an exercise to 
the student. 


14.7 Estimation of the regression function 


In the earlier sections, we looked at prediction functions and *best predictors", best 
in the minimum mean square sense. We found that in this case the “best” predictor of 
a dependent real scalar variable y at preassigned values of the real scalar variables 
X;,...,Xy would be the conditional expectation of y given x,,...,x,. For computing 
this conditional expectation, so that we have a good predictor function, we need at 
least the conditional distribution of y given x4, ...,x,. If the joint distribution of y and 
Xy, ..., Xy is available that is also fine, but in a joint distribution, there is more informa- 
tion than what we need. In most of the practical situations, we may have some idea 
about the conditional expectation but we may not know the conditional distribution. 
In this case, we cannot explicitly evaluate the regression function analytically. Hence 
we will consider various scenarios in this section. 

The problem that we will consider in this section is the situation that it is known 
that there exists the conditional expectation but we do not know the conditional distri- 
bution but a general idea is available about the nature of the conditional expectation 
or the regression function such as that the regression function is linear in the regressed 


476 ---- 14 Model building and regression 


variables or a polynomial type or some such known functions. Then the procedure is 
to collect observations on the variables, estimate the regression function and then use 
this estimated regression function to estimate the value of the dependent variable y 
at preassigned values of x,,...,x,, the regressed variables. We will start with k = 1, 
namely one real scalar variable to be predicted by preassigning one independent real 
scalar variable. Let us start with the linear regression function, here “linear” means 
linear in the regressed variable or the so-called *independent variable". 


14.7.1 Estimation of linear regression of y on x 


This means that the regression of the real scalar random variable y on the real scalar 
random variable x is believed to be of the form: 


E(y|x) = Bo + Bix (14.29) 


where βρ and βι are unknown because the conditional distribution is not available and 
the only information available is that the conditional expectation, or the regression of 
y on x, is linear of the type (14.29). In order to estimate fy and βι, we will start with 
the model 


y-a«bx (14.30) 


and try to take observations on the pair (y, x). Let there be n data points (y,,x;),..., 
(yn; Xn). Then as per the model in (14.30) when x = x; the estimated value, as per the 
model (14.30), is a 4 bx; but this estimated value need not be equal to the observed y; 
of y. Hence the error in estimating y by using a + bx; is the following, denoting it by e;: 


ej =y; - (a + bx;). (14.31) 


When the model is written, the following conventions are used. We write the model as 
y-a«bxory;-a« bx; * ej, j - 1,...,n. The error e; can be positive for some j, negative 
for some other j and zero for some other j. Then trying to minimize the errors by min- 
imizing the sum of the errors is not a proper procedure to be used because the sum of 
ej's may be zero but this does not mean that there is no error. Here, the negative and 
positive values may sum up to zero. Hence a proper quantity to be used is a measure of 
mathematical “distance” between y; and a + bx; or a norm in e;'s. The sum of squares 


ofthe errors, namely Èj ej isa squared norm or the square of the Euclidean distance 
between y;'s and a + bx,’s. For a real quantity if the square is zero, then the quantity 
itself is zero and if the square attains a minimum then we can say that the distance 
between the observed y and the estimated y, estimated by the model y = a + bx, is 


minimized. For the model in (14.31), 


> e = Σο; -a- bx. (14.32) 


14.7 Estimation of the regression function —— 477 


If the model corresponding to (14.31) is a general function g(a}, ...,a,, X, ..., Xy), for 
some g where a, ...,a, are unknown constants in the model, x,,...,x, are the re- 
gressed variables, then the j-th observation on (χι,...,Χχ) is denoted by (xj, xj, ... ΧΙ)» 
j=1,...,n and then the error sum of squares can be written as 


n n 
» - Σιν -= E esd Xp tpl: (14.33) 


The unknown quantities in (14.33) are a,,...,a,. If the unknown quantities a,,...,a,, 
which are also called the parameters in the model, are estimated by minimizing the 
error sum of squares then the method is known as the method of least squares, intro- 
duced originally by Gauss. For our simple model in (14.32), there are two parameters 
a,b and the minimization is to be done with respect to a and b. Observe that in (14.33) 
the functional form of g on x,,...,x, is unimportant because some observations on 
these variables only appear in (14.33) but the nature of the parameters in (14.33) is im- 
portant or (14.33) is a function of the unknown quantities a4, ..., a,. Thus when we say 
that a model is linear it means linear in the unknowns, namely linear in the parame- 
ters. If we say that the model is a quadratic model, then it is a quadratic function in the 
unknown parameters. Note the subtle difference. When we say that we have a linear 
regression, then we are talking about the linearity in the regressed variables where 
the coefficients are known quantities, available from the conditional distribution. But 
when we set up a model to estimate a regression function then the unknown quanti- 
ties in the model are the parameters to be estimated, and hence the degrees go with 
the degrees of the parameters. 

Let us look at the minimization of the sum of squares of the errors in (14.32). This 
can be done either by using purely algebraic procedures or by using calculus. If we 
use calculus, then we differentiate partially with respect to the parameters a and b 
and equate to zero and solve the resulting equations. 


a n 2 n 

hn 271-0, — 2/20 

δα »I ab pa κ 
n n E 

-2 Σο, -a-bx)20, => Σο, - â- bx) - 0. (14.34) 
j=l j=l 


n n 
-2 6; -α- θχ)-ο = Σχ; -ᾱ- bx;) =0. (14.35) 
j=l j=l 
Equations (14.34) and (14.35) do not hold universally for all values of the parameters 
a and b. They hold only at the critical points. The critical points are denoted by à and 
b, respectively. Taking the sum over all terms and over j, one has the following: 


n n n n n 
Yy;-nà-bYx;-0 and Yoxy;- à x; - bY x7 =0. (14.36) 
ja jal jal jal jal 


478 — 14 Model building and regression 


In order to simplify the equations in (14.36), we will use the following convenient no- 
tations. [These are also standard notations.] 


n x. n 


n y; 
y- ^. ii P 
‘ee 


(x; - au 


oy T ae = 

j=l n al 
These are the sample means, sample variances and the sample covariance. Under 
these notations, the first equation in (14.36) reduces to the following, by dividing by 
n. 


y-à -bx-0. 


Substituting for à in the second equation in (14.36), and dividing by n, we have 


n 
MM Dco Σ - 
De [y — bx]x dD 0. 


Therefore, 


"p z^ - (x)y) 


εσας 
5 1405-3); - Y) ; 
= 2 = re V; y and à-y-bx. (14.37) 
Sk jar XG - Ὁ” 


The simplifications are done by using the following formulae. For any set of real num- 
bers (χ»γι)» ett s à); 


Yo-2- e» y- Pc EX ie 
4 i i 


n 
Zo- = Y» -n(yy, Σα -Όθ;-γ)- Σα jy - nay). 
j=l j=l j=l 
When we used calculus to obtain (14.36), we have noted that there is only one critical 
point (à, b) for our problem under consideration. Does this point (à, b) in the param- 
eter space Q = {(a, b) | -00 < a < oo, -co < b < co] correspond to a maximum or mini- 
mum? Note that since (14.32) is the sum of squares of real numbers the maximum for 

A ej for all a and b, is at co. Hence the only critical point (à, b) in fact corresponds 
to a minimum. Thus our estimated regression function, under the assumption that 
the regression was of the form E(y|x) = Bp + fix, and then estimating it by using the 
method of least squares, is 


y=a+bx, à-y-bx, b= (14.38) 


Hence (14.38) is to be used to estimate the values of y at ed values of x. 


14.7 Estimation of the regression function —— 479 


Example 14.15. The growth of a certain plant y, growth measured in terms of its 
height in centimeters, is guessed to have a linear regression on x the time measured 
in the units of weeks. Here, x = 0 means the starting of the observations, x = 1 means 
at the end of the first week, x = 2 means at the end of the second week and so on. The 
following observations are made: 


x d) 1 2 3 4 
y 20 45 55 75 10.5 


Estimate the regression function and then estimate y at x = 3.5, x =7. 


Solution 14.15. As per our notation, n = 5, 


X= 6. 


01121314 2 ΕΞ. 
5 ay ees 5 7 


If you are using a computer with a built-in or loaded program for “regression”, then by 
feeding the observations (x, y) = (0, 2), (1, 4.5), (2, 5.5), (3, 7.5), (4, 10.5) the estimated lin- 
ear function is readily printed. The same thing is achieved if you have a programmable 
calculator. If nothing is available to you readily and if you have to do the problem by 
hand, then for doing the computations fast, form the following table: 


y x y-y χ-χ (x-xy (y-y«x-x ο y-» -θ 
2 0 -4.0 -2 4 8.0 2 0.0 0.00 
45 1 -15 -1 1 1.5 4 0.5 0.25 
55 2 -05 0 0 0 6 -0.5 0.25 
75 3 1.5 1 1 1.5 7 0.5 0.25 
105 4 45 2 4 9.0 10 0.5 0.25 
10 20.0 1.00 
Therefore, 


$- YX0;-30;-Y) 20 
ja - X» 10 


and 


à-y-bx-6- (2)(2) -2. 


Note 14.8. Do not round up the estimated values. If an estimated value is 2.1, leave 
it as 2.1 and do not round it up to 2. Similarly, when averages are taken, then also 
do not round up the values of x and y. If you are filling up sacks with coconuts and 
if 4020 coconuts are filled in 100 sacks, then the average number in each sack is 


400 = 40.2 and it is not 40 because 40 x 100 + 4020. 


480 — 14 Model building and regression 


Hence the estimated regression function is 
y=24+2x. 


Then the estimated value y of y at x = 3.5 is given by y = 2 + 2(3.5) - 9. 


Note 14.9. The point x - 7 is far outside the range of the data points. In the obser- 
vations, the range of x is only 0 x x x 4 whereas we are asked to estimate y at x - 7, 
which is far out from 4. The estimated function y = 2 + 2x can be used for this pur- 
pose if we are 100% sure that the underlying regression is the same function 2 + 2x 
for all values of x then we can use x = 7 and obtain the estimated y as y = 2+2(7) = 16. 
If there is any doubt as to the nature of the function at x = 7 then y should not be 
estimated at a point for x which is far out of the observational range for x. 


Note 14.10. In the above table for carrying out computations, the last 3 columns, 
namely, y, y - ὑ, (y — Y)? are constructed for making some other calculations later 
on. Theleast square minimum is given by the last column sum and in this example 
itis 1. 


Before proceeding further, let us introduce some more technical terms. If we apply 
calculus on the error sum of squares under the general model in (14.33), we obtain the 
following equations for evaluating the critical points: 


2 [y«|-o.. 2 [y4]-o (14.39) 
0d, L5 da, [ΕΙ 


These minimizing equations in (14.39) under theleast square analysis, are often called 
normal equations. This is another awkward technical term in statistics and it has noth- 
ing to do with normality or Gaussian distribution or it does not mean that other equa- 
tions have some abnormalities. The nature of the equations in (14.39) will depend 
upon the nature of the involvement of the parameters a,, ... , a, with the regressed vari- 
ables x, ..., X. 


Note 14.11. What should be the size of n or how many observations are needed to 
carry out the estimation process? If g(a}, ...,a,,X4, ..., xy) 15 a linear function of the 
form, 

αρ + Q4X4 +*+ + OX, 


then there are k + 1 parameters aj, ..., a, and (14.39) leads to k + 1 linear equations 
in k +1 parameters. This means, in order to estimate aj, ..., ay we need at least k +1 
observation points if the system of linear equations is consistent or have at least 
one solution. Hence in this case n > k + 1. In a non-linear situation, the number of 
observations needed may be plenty more in order to estimate all parameters suc- 
cessfully. Hence the minimum condition needed on n is n > k +1 where k + 1 is the 


14.7 Estimation of the regression function —— 481 


total number of parameters in a model and the model is linear in these k + 1 parame- 
ters. Since it is not a mathematical problem of solving a system of linear equations, 
the practical advice is to take n as large as feasible under the given situation so that 
a wide range of observational points will be involved in the model. 


Note 14.12. As a reasonable criterion for estimating y based on $(a,,...,a,, X} 
..., Xy), we used the error sum of squares, namely 


n n 
2e = Σο; - g(a, S60 p Xj v. e (14.40) 


Thisis the square of a mathematical distance between y; and g. We could have used 
other measures of distance between y; and g, for example, 


n 
Σιν, ee ο νο. (14.41) 
jal 


Then minimization of this distance and estimation of the parameters a,,...,a, 
thereby estimating the function g is a valid and reasonable procedure. Then why 
did we choose the squared distance as in (14.40) rather than any other distance 
such as the one in (14.41)? This is done only for mathematical convenience. For ex- 
ample, if we try to use calculus then differentiation of (14.41) will be rather difficult 
compared to (14.40). 


14.7.2 Inference on the parameters of a simple linear model 


Consider the linear model 


yj=a+bx +e; j=l... 


where we wish to test hypotheses on the parameters a and b as well as construct con- 
fidence intervals for these. These can be done by making some assumptions on the 
error variable e;, j = 1,...,n. Note that x;’s are constants or preassigned numbers and 
the only variables on the right are the ej's, thereby y;’s are also random variables. Let 
us assume that ej's are such that E (ej) =0, Var(e;) =o’, j=1,...,n and mutually non- 
correlated. Then we can examine the least square estimators for a and b. We have seen 
that 


y=atbx,+e = y=at+bx+é 


J 
A A ὦ 


b= - 4y - y). 
ja 


jas -X* 
- (x; - x) 
E d > d = J > d =0 
2) je» dj Yo! » F (a) 


482 — 14 Model building and regression 


E(b)-b since E(e;)=0, E(8)-0 


Var(b) = Εἴ - b]? px ο] from (a) 


m. 2 2 7 σ2 
S dE 0- EF Oyo? E (b) 
à-y -bx = [a + bx + ë] -bx-a«x[b - b] «e 
E[àá]-a since E(@)=0, E(b)-b (ο) 
Var(à) = E[à - a]? = E[x(b - b) + ë]? 
Iu eee o -f1 x 
09 ja OG - -x? n T "E ja 05 - XY (ὦ) 


If we assume further that d N(0,02),j 2 1,...,n, that is, iid N(0, 07), then both b and 
à will the normally distributed, being linear functions of normal variables, since the 
xps are constants. In this case, 


b-b 
u- 
1 
OV Soe 
n 
-γΣα -xy| | N(0,1) 
j=l 
and 
v= HER - N(0,1) 


But usually o? is unknown. Hence if we replace σ΄ by an unbiased estimator of o? 
then we should get a Student-t statistic. We can show that the least square minimum, 
denoted by 5’, divided by n - 2 is an unbiased estimator for o? for n > 2. [We will show 
this later for the general linear model in Section 14.74]. Hence 


u = ῥο E x B E P | eus (14.42) 


and 
- vies (14.43) 
δια. — 
n^ Mog" 
where t, 2 is a Student-t with n - 2 degrees of freedom, and 


02. least square minimum s 
n-2 n-2 


14.7 Estimation of the regression function —— 483 


Hence we can construct confidence intervals as well as test hypotheses on a and b by 
using (14.42) and (14.43). A 100(1 -- a)% confidence interval for a is 


A . 1 x? 
Tt, 5204|— τπτ 14.44 
+ n-2,5 ᾗ πια -xp ( ) 
and that for b is 
a 1 
bF tp 11 =m. (14.45) 
n-25 | i106 -xy 


Details may be seen from Chapter 12, and illustration is as in Figure 12.3. 

The usual hypothesis that we would like to test is Ηρ : b = 0, or in other words, 
there is no effect of x in estimating y or x is not relevant as far as the prediction of y 
is concerned. We will consider general hypotheses of the types H, : b = bọ (given) and 
Ηρ: a = αρ (given), against the natural alternates. The test criterion will reduce to the 
following: 

Hg : b= bo (given), H; : b + bo; criterion: reject Ηρ if the observed value of 


see 


) TIPS (14.46) 


j=1 


For testing Hp : a = αρ (given), Ηι : a + ag; criterion: reject Ηρ if the observed value of 


| ᾱ-- αρ 

a {1 x 

E y7 + Soe 
where in both ô? = m n » 2 with s? being the least square minimum. If hypotheses of 


the type Hp : b < by or Hy : b 2 bg the procedure is described in the section on testing 
hypotheses by using Student-t statistic in Section 13.3.2. 


| xis (14.47) 


Example 14.16. By using the data and linear model in Example 14.15, construct 9596 
confidence intervals for a and b and test the hypotheses Hy : b =0 and Hy :a-3ata 
5% level of rejection. 


Solution 14.16. We have made all the computations in the solution of Example 14.15. 
Here, n = 5, which means the degrees of freedom n - 2 = 3. We want 95% confidence 
interval, which means our a = 0.05 or 5 = 0.025. The tabled value of t; 5 925 = 3.182. 
Observed value of b = 2 and observed value of à = 2. From our data, Σια —- xy -10 
and least square minimum s? - 1. A 9596 confidence interval for b is given by 


ὅτε, T square minimum _, _ ΠΠ - 
* x 


3 Maj = x) 
~ [-15.43, 19.43]. 


484 — 14 Model building and regression 


A 959/ confidence interval for a is given by 


M NEC ENDE. G182 45 b+ 4 
7 a =2F 
n-23 Vn ια; - 3) 3 $^ 


= [0.58, 3.42]. 


For testing Ηρ: b = 0, we reject if the observed value of 


Σο -xy [1-2 SIE Vio| v3 - o) 


z= 10.95 2 tn-2,a/2 = 15 0.025 = 3.182. 


Hence the hypothesis is rejected at the 5% level of rejection. For Ηρ: a = 3, we reject 
the hypothesis, at a 5% level of rejection, if the observed value of 


κ . 1 x? 
la ANT e emu πίη; σΞ|2 ICE 


= 0.45 2 150.025 5 3.182. 


Hence this hypothesis is not rejected at the 5% level of rejection. 


Exercises 14.7 


14.7.1. The weight gain y in grams of an experimental cow under a certain diet x in 
kilograms is the following: 


x012 3 4 5 
y 2 6 10 18 30 40 


(i) Fit the model y = a + bx to this data; (ii) compute the least square minimum; (iii) es- 
timate the weight gain at x = 3.5,x = 2.6. 


14.7.2. For the same data in Exercise 14.7.1, fit the model y = a + bx + cx?, c + O. (i) Com- 
pute the least square minimum; (ii) by comparing the least square minima in Exer- 
cises 14.7.1 and 14.7.2 check to see which model can be taken as a better fit to the data. 


14.7.3. For the data and model in Exercise 14.7.1, construct a 99% confidence interval 
for a as well as for b, and test, at 1% level of rejection, the hypotheses H, : b = 0 and 
Hy: a=5. 


14.7.3 Linear regression of y on x;,... Xk 


Suppose that the regression of y on x,, ..., x; is suspected to be linear in x,,...,x;,, that 
is, of the form: 


E(yba, «+ Xk) = Bo + By ++ + Bixie- 


14.7 Estimation of the regression function —— 485 


Suppose that n data points (y;, x4, ... X35), j = 1, ..., nareavailable. Since the regression 
is suspected to be linear and if we want to estimate the regression function, then we 
will start with the model 


Y = do + 01X1 t + OX. 
Hence at the j-th data point if the error in estimating y is denoted by e; then 
ej=yj- [ao tax t tax] Job... n 


and the error sum of squares is then 


n n 
Σοὶ = My; - ao - ayaj = - apa! (14.48) 
j=l j=l 
We obtain the normal equations by differentiating partially with respect to ao, 
a, ...,Qj, and equating to zeros. That is, 


a n n n n 
Μο nus 
j=l j=l jl 


04 LET 


We can delete —2 and divide by n. Then 


àg-2y-üàjx--- — AX, (14.49) 


where 


"x 


ES i=1,...,k 


j=1 


n 
= yj is 
= E 
y dn i 


and ἂρ, ἂν i = 1,..., k indicate the critical point (âọ,..., ἂχ) or the point at which the 
equations hold. Differentiating with respect to aj, i = 1,..., k, we have 


ὁ [« : A A 
z|x4|-o > -2 xjly; - à - ἂιχη - + - ἄρχῃ] = 0 
ai [31 ja 
n n n n 
> ΧΙ = ἂρ δρυ + ài ΧΡ “γής ae ay 2 Xie (14.50) 
T I= J= I= 


Substituting the value of ἂρ from (14.49) into (14.50) and rearranging and then dividing 
by n, we have the following: 


5. 


iy = ἂιδῃ s QS spese ἂχδχ, i- 1, RET „k (14.51) 


where 


c Gu = Xi) Gn 7X) 
ij 2. Lc σσ Sii; 
ka 


486 —— 14 Modelbuilding and regression 


n m ES 
(x - Y)0Gx - Xi) 
ο θὰ Vk DTE i 
k=1 


or the corresponding sample variances and covariances. If we do not wish to substitute 
for ἂρ from (14.49) into (14.50), then we may solve (14.49) and (14.50) together to obtain 
a solution for (ἄρ, à, ... , ἂχ). But from (14.51) we get only (@, ...,à,) and then this has 
to be used in (14.49) to obtain ἂρ. From (14.51), we have the following matrix equation: 


Siy 70484 + 05815 + c + AKS1k 


Say = 41521 + 0585) + +++ + ALS, 


Sky = ιδ t ἄρδιο Eases ASK or 


S, - Sà 
where 
Sy ài 
Sy = > 1, ἄξι 11, S= (sy), 
Sky ἂχ 
πι αι =X) OG -αρ _ 
y m n Jl 
From (14.52), 
ài 
à-| :|-S7S, for|S| 4 0. 
y 
From (14.53), 
χι 
ü.-y-á'x2y-S'S ix E : 
o-y-ax-y-S,S X, Χ-|: 
Xk 


(14.52) 


(14.53) 


(14.54) 


Note 14.13. When observations on (x,,...,X;,) are involved, even if we take extreme 
care sometimes near singularity may occur in S. In general, one has to solve the sys- 
tem of linear equations in (14.52) for which many standard methods are available 
whether the coefficient matrix, in our case S, is non-singular or not. In a regression- 
type model, as in our case above, the points (x35, Xj» ... χι), J = 1, ..., n are preas- 
signed, and hence while preassigning, make sure that data points for (x4, ...,x,), 
which are linear functions of other points which are already included, are not taken 
asanew data point. If linear functions are taken, then this will result in S being sin- 


gular. 


14.7 Estimation of the regression function —— 487 


Example 14.17. In a feeding experiment on cows, it is suspected that the increase in 
weight y has a linear regression on the amount of green fodder χι and the amount of 
marketed cattle feed x, consumed. The following observations are available; all ob- 
servations on x, and x, are in kilograms and the observations on y are in grams: 


χι 1 15 2 1 25 1 
xs 2 15 1 15 2 ἡ 
y 5 5 6 45 75 8 


Construct the estimating function and then estimate y at the points (i) (x,,x>) = 
(1,0), (1,3), (5, 8). 


Solution 14.17. As per our notation n - 6, 


= (1.0+1.5+2.0 - 1.0 + 2.5 + 1.0) -15, 
.  (204+1.54+104+15+2.0 + 4.0) 
χο ς Ξ2, 
ys (5.0 +5.0 6.0445 47.5480) — 6 
ς . 


Again, if we are using a computer or programmable calculator then regression prob- 
lems are there in the computer and the results are instantly available by feeding in the 
data. For the calculations by hand, the following table will be handy: 


Y X*« X y- Χι-Χχι X;-X; (y-y)0a-X) 


50 10 20 -1 -0.5 0 0.5 
50 15 15 -1 0 -0.5 0 
60 20 10 0 0.5 -1 0 
45 10 15 -15  -05 -0.5 0.75 
75 25 20 15 1.0 0 15 
80 10 40 20 -0.5 2.0 -1.0 
1.75 


(y-3)05-X) 04-3)? (α;-12} (αι--χι)α; -Ἀ}) 


0 0.25 0 0 
0.5 0 0.25 0 

0 0.25 1.0 -0.5 
0.75 0.25 0.25 0.25 

0 1.0 0 0 
4.0 0.25 4.0 -1 
5.25 2.0 5.5 —1.25 


The equations corresponding to (14.52), without the dividing factor n - 6, are the fol- 
lowing: 


488 — 14 Model building and regression 


1.75 = 2a, - 1.253; 
5.25 =-1.254, +554, = Â =1.72,â, = 1.34. 
Then 
Ay =Y - àjX, - AX) 
= 6 — (1.72)(1.5) - (1.34)(2) = 0.74. 
Hence the estimated function is given by 


y 2074 + 1.725x, + 1.34x5. 


The estimated value of y at (x,,xX) = (1,3) is f = 6.48. The point (x,, x) = (5,8) is too far 
out of the observational range, and hence we may estimate y only if we are sure that 
the conditional expectation is linear for all possible (x,,x,). If the regression is sure to 
hold for all (x,,x), then the estimated y at (x1, x5) = (5, 8) is 


y 20.74 + (172)(5) + (134)(8) = 20.06. 
For example, 


at(,3,)-(,2, j$-514; at(x,X,)-(L5,15) j-533; 
at(,3,) 2 (2,1, $2552 at0q)-(L4) y-782. 


Hence we can construct the following table: 


y y-y (-y»y 


5.33 -0.33 0.1089 
5.52 0.48 0.2304 
45 44  —0.03 0.0009 
75 7.72 -022 0.0484 
8 7.82 0.18 0.0324 


y 
5 514 -0.14 0.096 
5 
6 


0.4406 


An estimate of the error sum of squares as well as the least square minimum is 0.4406 
in this model. 


Exercises 14.7 


14.7.4. If the yield y of corn in a test plot is expected to be a linear function of x, = 
amount of water supplied, in addition to the normal rain and x, = amount of organic 
fertilizer (cow dung), in addition to the fertility of the soil. The following is the data 
available: 


14.7 Estimation of the regression function —— 489 


x, 0 0 d. 2 15 25 3 
x% 0 1 1 15 2 2 3 
y 2258 7 9 10 


(i) Fita linear model y = αρ + a,x, + a5x; by the method of least squares. 
(ii) Estimate y at the points 


0,5) = (90.5, 1.5), (3.5, 2.5). 


(iii) Compute the least square minimum. 


14.7.4 General linear model 


If we use matrix notation, then the material in Section 14.7.3 can be simplified and 
can be written in a nice form. Consider a general linear model of the following type: 
Suppose that the real scalar variable y is to be estimated by using a linear function of 
Xj, ..., X4. Then we may write the model as 


yj 70g t AX yt AX * t üpXy te, ]-Ί...»Π. (14.55) 


This can be written as 


Y=XB +e, 
γι χι Xp do 

ο η spo ο ρω, (14.56) 
Yn 1 Xn -> Xpn αρ 


Then the error sum of squares is given by 
e'e = (Y - XB) (Y - Xp). (14.57) 
Minimization by using vector derivative (see Note 14.7) gives 
9 
ape “- ο 2 -2X' (Y-Xp)-O (14.58) 
= B=(X'X)'X'Y for|X'X| 10. (14.59) 
Since X is under our control (these are preassigned values), we can assume X' X to be 
non-singular in regression-type linear models. Such will not be the situation in design 
models where X is determined by the design used. This aspect will be discussed in the 
next chapter. The least square minimum, again denoted by 5”, is given by 
s?-(Y-XB)'(Y -XB)-Y'(Y -XB) due to (14.58) 
-Y'Y -Y'X(X'X) !x'Y - Y'[- X(X'X) !x']Y = Y'UI - B]Y 


490 —— 14 Model building and regression 


where 
B-X(X'X) X' - E". 
This shows that B is idempotent and of 


rank = tr[X(X' X) !x'] = tr[(X' X) 'X'X] 
-U[D,]-p*1 (14.60) 


Hence the rank of I -X(X' X) ! X' is n-(p +1). Note further that I - B = (I- BP, (I-B)B = 
O and hence from Section 10.5 of Chapter 10 we have u = s? = Y'[I - X(X' X) !X']Y 
and v = Y'X(X'X)!X'Y are independently distributed when e ~ N,,(O,07I,,) or Y ~ 
N, (XB, o? I, ). Further, 
1 EE 1 £ 
i YI -X(X'X) Ix! = ai - XB)[I - X(X' X) ΙΧ] - XB) 
~Xn-(p+t) 


where x? denotes a central chi-square with v degrees of freedom, and 


1 


a Y'X(X' X) !X'Y ~ Xa A) 


where χρ (A) is a non-central chi-square with p + 1 degrees of freedom and non- 
centrality parameter A = 5B (X' X)B. [See the discussion of non-central chi-square 
in Example 10.9 of Chapter 10.] Hence we can test the hypothesis that f = O (a null 
vector), by using a F-statistic, under the hypothesis: 


r .. νο 
POD στη 
v - Y'X(X'X) !X' Y, 
u -Y'[I- X(X'X) !x']Y (14.61) 


and we reject the hypothesis for large values of the observed F-statistic. 
Note that the individual parameters are estimated by the equation 


B-(x'x)'x'vY (14.62) 


the various column elements of the right side gives the individual estimates. What 
is the variance-covariance matrix of this vector of estimators f? Let us denote the 
covariance matrix by Cov(f). Then from the definition of covariance matrix, 
Cov(f) = (X' X) ΙΧ’ Cov(Y)X(X'X) ! = (X'X) !o?I(X' X)(X'X)! 
so! Xy. (14.63) 


14.7 Estimation of the regression function —— 491 


How do we construct confidence interval for the parameter a; in B and test hypotheses 
on qj, j = 0,1, ..., p? Let à; be the (j + 1)-th element in the right side column in (14.62) 
and let bj,,;,, be the (j + 1,j + 1)-th diagonal element in (x'X)!, j=0,1,...,p. Then 


or a Student-t with n -- (p + 1) degrees of freedom, where 


τὸ s? least square minimum 


0^ - 
n-(p+1) n-(p+1) 
Then use this Student-t to construct confidence intervals and test hypotheses. Since it 
will take up too much space, we will not do a numerical example here. 
Sometimes we may want to separate ἀρ from the parameters ay, ... , ἄν. In this case, 
we modify the model as 


(14.64) 


yj -y = Ay (Xj - X) + 506; - X5) + ^ + ay tj - Xp) + 6; - 6. (14.65) 


Now, proceed exactly as before. Let the resulting matrices be denoted by Y, X, B, e 
where 


see Me e . | a EE 

Y- > p= : > é- E 
Yn-Y ay €, -e 

g Xu- ... Xp Χρ 

X= : : : (14.66) 
χμ XQ 6 Xpn Χρ 


Then the estimator, covariance matrix of the estimator, least square minimum, etc. are 
given by the following, where the estimators are denoted by a star: 


P.-R EY 
Cov(B,) = o?(X' X)! 


s2 -ô= least square minimum 
n-1-p n-1-p 
| Y'u-X& Xy)xy 
7 n-1-p ᾿ 


Under normality assumption for e ~ N(0,0?I,) we have 


Y'XuXY 
ve LEO DET ado 


where u and v are independently distributed, and the non-centrality parameter A, = 
1' (X'X)B. Note that u and v are independently distributed. 


492 —— 14 Model building and regression 


Note 14.14. Models are classified as linear or non-linear depending upon the lin- 
earity or non-linearity of the parameters in the model. All linear models can be 
handled by the procedure in Section 14.74 by renaming the coefficients of the pa- 
rameters. For example, (1) y = ag + a,x + ax? +--+ + ayx* (write x = uy, x? = Up, ..., 
x* = uy) and apply the techniques in Section 14.74, (2) y = ag + apx; + ax? + a3x3 
(Write μη = X1X2, u = X2, u = x2), (3) y = αρ  a4X4X; + AXX; + a3X2X» (write u, = ΧΙΧ2, 
U = Χ2Χ:, U3 = X? X2), are all linear models, whereas (4) y = ab*, (5) y = 34* X are non- 
linear models. There are non-linear least square techniques available for handling 
non-linear models. That area is known as non-linear least square analysis. 


Note 14.15. In some books, the student may find a statement of the type, asking 
to take logarithms and use linear least square analysis for handling a model of the 
type y - ab*. This is wrong unless the error e is always positive and enters into the 
model as a product or unless the model is of the form y; = ab% ej, with e; > 0, which 
is a very unlikely scenario. We are dealing with real variables here and then the 
logarithm cannot be taken when e; is negative or zero. If ab* is taken to predict y, 
then the model should be constructed as y; = ab% + ej, j = 1,..., n. Then the error 
sum of squares will become 


Ye = Σο, -αθύγ (a) 


and it will be difficult to handle this situation. The analytic solution will not be 
available for the normal equations coming out of this equation (a) here. There are 
several methods available for handling non-linear least square problems. A non- 
linear least square analysis is to be conducted for analyzing models such as y = ab*. 
The most frequently used non-linear least square analysis technique is Marquardt's 
method. For a very efficient algorithm for non-linear least squares, which usually 
never fails, may be seen from [9]. 


Exercises 14.7 


14.7.5. For the linear model and data in Exercise 14.74, construct 9596 confidence in- 
tervals for (1) αι; (2) ay, and test the hypotheses, at 5% level of rejection, (3) Hp : a, = 0; 
(4) Hg : a = 0; (5) Ho : (a4,a5) = (0,0). 


15 Design of experiments and analysis of variance 


15.1 Introduction 


In Chapter 14, we have looked into regression and regression-type models in the area of 
model building. Here, we consider a slightly different type of model known as design 
type models. All the technical terms used in this area are connected with agricultural 
experiments because, originally, the whole area was developed for agricultural exper- 
imentation of crop yield, methods of planting, the effects of various types of factors 
on the yields, etc. The main technical terms are the following: A plot means an ex- 
perimental unit. If different methods of teaching are compared and the students are 
subjected to various methods of teaching, then the basic unit on which the experi- 
ment is carried out is a student. Then a student in this case is a plot. If the breaking 
strength of an alloy is under study and if 15 units of the same alloy are being tested, 
then each unit of the alloy is a plot. If the gain in weight of experimental animals is 
under study, then a plot here is an experimental animal. If experimental plots of land 
are there where tapioca is planted and the experiment is conducted to study the ef- 
fect of different types of fertilizers on the yield, then an experimental plot is a plot of 
land. The basic unit which is subjected to experimentation is called a plot. A group of 
such plots is called a block or block of plots. If a piece of land is divided into 10 plots 
and experimentation is done on these 10 plots, then this block contains 10 plots. If an- 
other piece of land is divided into 8 plots for experimentation, then that block contains 
8 plots. The item or factor under study in the experimentation is called a treatment. 
In the case of students being subjected to 3 different methods of teaching, there are 
3 treatments. In the case of 6 different fertilizers being studied with reference to yield 
of tapioca, then there are 6 treatments, etc. For a proper experimentation, all the plots 
must be homogeneous, within and between, as far as variations with respect to all fac- 
tors are concerned, which are not under study. For example, if 3 different methods of 
teaching are to be compared, then the students selected for this study must have the 
same background, same exposure to the subject matter, or must be the same as far as 
all other factors are concerned, which may have some relevance to the performance of 
the students, performance may be measured by computing the grades obtained in an 
examination at the end of subjecting the student with a particular method of teach- 
ing. 

Hence planning of an experiment means to make sure that all experimental 
plots are fully homogeneous within and between with respect to all other factors, 
other than the factors under study, which may have some effect on the experimental 
outcome. If one variety of corn is planted on 10 different plots of land, where the 
plots have different natural soil fertility level, different water drainage, different ex- 
posure to sun, etc., then the plots are not homogeneous within or between them. 
If we are trying to study the effect of different fertilizers on the yield of corn, then 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. JBA] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https: //doi.org/10.1515/9783110562545-015 


494 ---- 15 Design of experiments and analysis of variance 


the pieces of land (experimental plots) taken must be of the same type and fully ho- 
mogeneous with respect to all factors of variation other than the effect of fertilizers 
used. 


15.2 Fully randomized experiments 


In this experiment, suppose that we have selected n, plots to try the first treatment, 
n; plots to try the second treatment, ..., n; plots to try the k-th treatment, then all 
n,+---+n, =n plots must be fully homogeneous within and between and with respect 
to all factors of variation. If it is an agricultural experiment involving the study of 5 
different method of planting of one type of rubber trees, then the plots of land selected 
must be of the same shape and size, of the same basic fertility of the soil, of the same 
type of elevation, same type of drainage, same type of precipitation, etc., or in short, 
identical with respect to all known factors which may have some effect on the yield of 
rubber latex. It may not be possible to find such n plots of land at one place but we 
may get identical plots at different locations, if not at the same place. There should 
not be any effect of the location on the yield. Suppose we have n, plots at one place, 
n; plots at another place, ..., n; plots at the k-th place but all then =n, + --- +n, plots 
are identical in every respect. Then take one ofthe methods at random and subject the 
πι plots to this method, take another method at random, etc. or assign the methods 
at random to the k sets of plots. If it is an experiment involving 4 different methods of 
teaching and if all n homogeneous students can be found in one school, then divide 
them into different groups according to convenience of the teachers and subject them 
to these 4 different methods of teaching n, + --- +n, =n. It is not necessary to divide 
them into groups of different numbers. If equal numbers can be grouped, then it is well 
and good. In most of the situations, we may find 50 students in one school with the 
same background, 30 students with the same background within the group as well as 
between the groups in another school, etc., and thus the numbers in the groups may be 
different. The 50 students may be subjected to one method of teaching, the 30 another 
method of teaching, etc. Let x; be the grade obtained by the j-th student (j-th plot) 
under the i-th method of teaching (i-th treatment). Then the possibility is that xj may 
contain a general effect, call it uj, an effect due to the i-th method of teaching, call it a;. 
The general effect can be given the following interpretation. Suppose that the student 
is given a test without subjecting the student to any particular method of teaching. We 
cannot expect the student to get a zero grade. There will be some general effect. Then 
αι can be interpreted as the deviation from the general effect due to the i-th treatment. 
In the experimentation, we have only controlled all known factors of variation. But 
there may be still unknown factors which may be contributing towards xj. The sum 
total effect of all unknown factors is known as the random effect ej. Thus, x; in this 
fully randomized experiment is a function of p, a; and ej. What is the functional form 
or which way these effects enter into x;;? The simplest model that we can come up with 


15.2 Fully randomized experiments —— 495 


is a simple linear additive model or we assume that 
-Htüpteg j-1. ok eg ~ N(0,0?). (15.1) 
That is, for i= 1, 


Xy =H +A + Cy 
Xi =H + Ay + Cy) 


Xin, =H + 03 + Eq, 


and similar equations for i = 2, ... , k. If the a;’s are assumed to be some unknown con- 
stants, then the model in me is called a simple additive fixed effect one-way classifi- 
cation model. Here, one-way classification means that only one set of treatments are 
studied here, one set of methods of teaching, one set of fertilizers, one set of varieties 
of corn, etc. There is a possibility that α 5 could be random variables then the model 
will be a random effect model. We will start with a fixed effect model. 

The first step in the analysis of any model is to estimate the effects and then try 
to test some εν να by putting some assumptions on the random part e,’s. For 
estimating 4, αι, i = 1,..., k, we will use the method of least squares because we de not 
have any dd ‘of any distribution on 6/5 or xj's. The error sum of squares is 
given by the following, where we can use calculus fot minimization, observing that 
the maximum is at co and hence the critical point will correspond to a minimum. 


k πι 
2 ej = Oo αι) (8) 
i=1 j=1 
9 ^ ^ 
3:2,679 > -Y0g-h-à)-0 (b) 
Hi j 
k k . 
= x,-nj-ná;-0 > p= E. e (c) 


because when we sum up with respect to j we get n; and then summation with respect 
to i is denoted by n . Also the standard notation Y;x; = x;, Y;xj - x;, when j is free 
of i, will be used, where the summation with respect to a subscript is denoted by a dot. 
Hence x, = 5,4 Xj. Differentiation of (a) with respect to a; for a specific i such as αι will 
yield the following: 


> d ο (d) 


496 ---- 15 Design of experiments and analysis of variance 


Note that in (c), without loss of generality, >, 1a; can be taken as zero because our 
a;’s are defined as deviations from the general effect due to the i-th treatment, then 
the sum of the deviations is zero, because for any set of numbers γι,...,γῃ» (γι - Y) + 
4 (y, — y) = 0. Our analysis in equations (a) to (d) will go through even if we do not 
wish to use this condition. The least square minimum, denoted by 5”, is available from 
(a) by substituting the least square estimates fi and à;. That 15, 


s? = Dixy -A- 4) = Σία -ᾱ- E -a)) 
2 (s : n. y (15.2) 
= Σία 7 | ΣΙ i ) s 


: J (154) 


This s? is called the residual sum of squares. All the different representations in (15.2) 
to (15.4) will be made use of later. The derivations are left to the student. If we have a 
hypothesis ofthe type Ηρ : a, 20-05 =--- = ay, then under this Ηρ the model becomes 
xj = H + ej and if we proceed as before then the least square minimum, denoted by οὖ, 
is given by the following: 


2 x, v 2 X 
ee X(x% 2 n) -Σαὶ---. (15.5) 


Then 


ieu on 


ij i - bo 


can be called the sum of squares due to the hypothesis or the sum of squares due to 
the a;’s. Thus we have the following identity: 


55 = [s - $7] + [57] 
= sum of squares due to the treatments + residual sum of squares 
- between treatment sum of squares 


* within treatment sum of squares 


Definition 15.1 (Analysis of variance principle). The principle of splitting the to- 
tal variation in the data into the sum of variations due to different components is 
known as the analysis of the variance principle or the ANOVA principle. 


15.2 Fully randomized experiments —— 497 


Since we are not dividing by the sample sizes and making it per unit variation 
or we are not taking sample variances, the principle is more appropriately called the 
analysis of variation principle. For a one-way classification model as in (15.1), there 
is only one component of variation, namely one set of treatments. More elaborate de- 
signs will have more components of variation. 

In order to test hypotheses of the type Ηρ : a, = --- = a, = 0, which is the same as 
saying μι = --- = µχ where y; = y + aj, we will make use of the two results from Chap- 
ter 10, namely Result 10.14 on the chi-squaredness of quadratic form and Result 10.15 
on the independence of two quadratic forms in standard normal variables. The chi- 
squaredness, in effect, says that if the m x 1 vector Y has a m-variate normal distri- 
bution Y ~ N,,,(O,07I,,), where o? is a scalar quantity and I,, is the identity matrix of 
order m, then E Y'AY, A = A', is a chi-square with v degrees of freedom if and only if 
A is idempotent and of rank v. Result 10.15 says that two such quadratic forms Y'AY 
and Y' BY are independently distributed if and only if AB = O, where O is a null matrix. 
We will make use of these two results throughout the whole discussion of Design of 
Experiments and Analysis of Variance. Derivations of each item will take up too much 
space. One item will be illustrated here and the rest of the derivations are left to the 
student. For illustrative purposes, let us consider 


-goraren pens η] (8-2) 


ij i ij i 


If we write x. and eq) for the subvectors, 


Xn ey e- δι. 
Xp € d 
Xa) = : > ea) = : then : (15.7) 
H . 6 
Cin, - or 
Xin, Cin, i» S 


which can be written in matrix notation as (Ij, — Βι)έιῃ where B, = T Τη! with Ἢ = 
(1,1, ...,1). We note that B, = B? and hence B, is idempotent. Further, (I -- Bj)? = (I - B) 
or I - B, is also idempotent. Then we can write 


χι V 
Σία Ξ *.) =e! (1 -- Be 
j ni 
where 6’ = (es, ... Cin,» 2p 898, o Ekt» +++ €kn,) and B will be a block diagonal ma- 


trix with the diagonal blocks being the matrices B,,...,B, where Bj = = JJ! . Further, 
we note that J - B is idempotent and of rank n - k. Therefore, from Result 10.13, 


—3 ~Xn-k (15.8) 


498 — 15 Design of experiments and analysis of variance 


that is, the least square minimum divided by σ’ is a chi-square with n — k degrees of 
freedom. In a similar fashion, we can show that 


2 
x 
sb Σου - n) -0'y; , under Ho 


2 
s =s? = Y = =) ~o?°x?; under Hy (15.9) 


and that s? and (50 - s?) are independently distributed. Here, the decomposition is of 
the following form: 


Xn -1 EXk-1 Nae (15.10) 
From (15.8), (15.9) and Result 10.15, it follows that 


(sà -s°)/(k-1) 


SUE ^ Εκ"... under Ηρ. (15.11) 


[If Ho is not true then the left side of (15.11) will be a non-central F with the numerator 
chi-square being non-central.] The test criterion will be to reject for large values of this 
F-statistic or reject Ho, at the level a, if the observed value of Εκ 4, -k 2 Fk 44 κα 


15.2.1 One-way classification model as a general linear model 


The model in (15.1), which is a linear fixed effect one-way classification model, can 
be put in matrix notation as a general linear model of the type Y = XB + e of Chap- 
ter 14. Here, Y is the n x1 = (πι + --- +x) x 1 vector of the observations x;'s or Y’ = 
(Χιν-..» Xan Xap +++ Xn, +++ Xk +++ > Xkn, )» € 5 the corresponding vector of e;j’s. β is the 
(k +1) x 1 vector of parameters or β' = (µ,αι....,ακ). Here, X is the design matrix. It 15 
n x (k+ 1) or (nj +--+ + ny) x (k + 1) matrix with the first column all ones, the second 
column is all ones for the first πι rows only, the third column is all ones from (n, +1)-th 
row to (n, + n,)-th row, and so on. In other words, the sum of the second to (k + 1)-th 
column is equal to the first column. If we delete the first column, then all the remaining 
columns are linearly independent and thus the column rank of X is k. But n. > (k & 1), 
and hence the rank of the design matrix in this case is k, and thus X is a less than full 
rank matrix and therefore X'X is singular. If we use the notations of (15.7) then the 
model in (15.1) can be written as follows, as a general linear model Y = XB + e, where. 


Xq) H eo 
y-|*@ i έω 


, =|. |> e= 


Xk αι €(k) 


> 


15.2 Fully randomized experiments —— 499 


pe ον (ue 0 
LOLO O 

X=} 0 0h Ὁ (15.12) 
k ο ο ο... i 


where Jn is an m x 1 column vector of ones, m = n,,15, ...,n,, and O denotes a null 
matrix. In this general linear model if we wish to estimate the parameter vector, then 
the minimization of the error sum of squares e'e leads to the normal equation: 


X'XB-X'Y => B=(X'X)X'y. (15.13) 


In the light of the discussion in (15.12), note that (15.13) is a singular system of normal 
equations. Hence there is no unique solution since (X' X) does not exist. A solution 
can be written in terms of a g-inverse (X'X) of X'X as indicated in (15.13). Thus, if 
we use matrix-methods in analyzing a one-way classification model, or other design 
models, then the procedure will be complicated. It will be seen that the procedure 
adopted in Section 15.2 of separating the sum of squares is the simplest method, and 
we will be using the same type of procedures in other design models also. 


15.2.2 Analysis of variance table or ANOVA table 


In data analysis connected with Design of Experiments, usually the final analysis is 
put in a nice tabular format, known as the Analysis of Variance Table or the ANOVA 
Table. For a one-way classification linear fixed effect model, which is applicable in a 
completely randomized experiment, the following is the format of the ANOVA table, 
where d.f = degrees of freedom, S.S = sum of squares, M.S = mean squares. 


ANOVA table for a one-way classification 


Variation due to d.f S.S M.S F-ratio 
(1) (2) (3) (4) = (3)/(2) (5) 


Between treatments k-1 i x -C.F T T/E ~ (Fk Ani) 
Within treatments n -k (subtract) E 
Total n-1 Pjgxj-C.F 


where C.F stands for “correction factor", which is C.F = χ}/η. In this ANOVA table, 
there is a (6)-th column called “Inference”. Due to lack of space, this column is not 
listed above. In this column, write “significant” or “not significant” as the case may be. 
Here, "significant" means that the observed F-value is significantly high or we reject 


500 —— 15 Design of experiments and analysis of variance 


the null hypothesis of the effects being zero, or in the one-way case the hypothesis 
is that a,’s are zeros or the treatment effect are zeros and this hypothesis is rejected. 
Otherwise write “not significant". The residual sum of squares can be obtained by 
subtraction of the sum of squares due to treatments, namely, sĝ -$ = » x -C.F from 
the total sum of squares Èi X; - C.F. Similarly, the degrees of freedom corresponding 
to the residual sum of squares s? is available from total degrees of freedom, namely 
n — 1, minus the degrees of freedom for the treatment sum of squares, namely k - 1, 
which gives (n - 1) - (Kk -1) 2n - k. 


Example 15.1. The following table gives the yield of wheat per test plot under three 
different fertilizers. These fertilizers are denoted by A, B, C. 


Yield of wheat under fertilizers A, B, C 


Total 
A 50 60 60 65 70 80 75 80 85 75 700 
B 60 60 65 70 75 80 70 75 85 80 720 
C 40 50 50 60 60 60 65 75 70 70 600 


Assume that a one-way classification fixed effect model is appropriate. Test the hy- 
pothesis, at a 596 level of rejection, that the fertilizer effects are the same, assuming 
ej ~ N(O, o?) and mutually independently distributed. 


Solution 15.1. Let Xij be the j-th observation under the i-th fertilizer, i = 1,2,3 and here 
all the sample sizes are equal to 10, and hence j = 1, ...,10. 


$ xy =X, = 600 + 720 +700 = 2020; 
j 

2020)? 
epee) 


2 
aan = 136 013.33; 
n 
2 
2 4 
Σ E = — [600? + 720? + 7007] = 136 840; 
7 πι 10 
Σχῇ - C.F 2502? 4 ... +702 — C.F = 3636.67; 
j 
2 
Y E -C.F = 827.67. 


i i 


Now, we can set up the ANOVA table 


Variation d.f S.S M.S F-ratio 
due to 
Between fertilizers  k-1-2 826.67 413.33 tee > 3.25 
Within fertilizers 27 2810.00 104.08 


Total n -1-29 3636.67 


15.2 Fully randomized experiments —— 501 


The tabulated point F, 5; 9.95 = 3.25 and our observed F, 57 > 3.25, and hence we reject 
the hypothesis that the effects of the fertilizers are equal. In the column on inference, 
which is not listed above, we will write as “significant” or the F-value is significantly 
high. 


15.2.3 Analysis of individual differences 


If the hypotheses of no effect of the treatments is rejected, that is, if the observed 
F-value is significantly high, then there is a possibility that this high value may be 
contributed by some of the differences a; - a;, i + j being not equal to zero or some 
of the individual differences may not be zeros. If the hypothesis of no effect is not re- 
jected, then we stop the analysis here and we do not proceed further. We may proceed 
further only when the hypothesis is rejected or when the F-value is found to be signifi- 
cantly high. Individual hypotheses of the types Ηρ : a; — a; = 0 fori + j can be tested by 
using a Student-t test when we assume that ej; ~ N(0, 0?) for all i and j and mutually 
independently distributed. Note that the least square estimate is 

Xi c 


--Q= 


αι j 


πι πι 


with variance 


Var(à; - à) = o(— + 3 
nj; n 
and under the hypothesis Ηρ : a; - a; = 6 (given) 


ᾱ, - âj- δ 


1 1 τ nk 


Άδης Least square minimum 
n -k 


where 


Hence the test criterion is the following: Reject Ho : a; — a; = 6 (given), if the observed 
value of 


Xp Mes 
no nj δ 


where Prít, κ 6, ss] = $. A100(1 - a)*6 confidence interval for a; — a; is then 


502 —— 15 Design of experiments and analysis of variance 


πα XN AE 
ο +. 
ni nj ΣΕ qn m 


Note 15.1. In order to see which difference or for which i and j, αι -- aj is contribut- 
ing towards the significant sum of squares due to the a;’s, we should be considering 
all differences à; - à;. Hence a practical procedure is the following: Take the largest 
absolute difference E - z |, then take the next largest difference, and so on, and 
test the corresponding hypotheses a; - a; = 0 until the difference is found to be 
not significant and then stop. If à, — à, is found to be significant or the hypothesis 
a, — a; = Ois rejected, then an estimate of a, — a; is z - zie By using the same proce- 
dure one can test hypotheses and construct confidence intervals on linear functions 
στα] +-+: + Cy&, for specific c, ..., c. Then take c;’s such that c; + --- +c, = 0 so that 
the contribution from μ, the general effect, is canceled. Such linear functions are 
often called cosets. 


Example 15.2. In Example 15.1, test the hypotheses on individual differences or hy- 
potheses of the type Ηρ : a; — a; = 0 and see which differences are contributing towards 
significant contribution due to the a;'s. Test at a 5% level of rejection. 


Solution 15.2. The individual differences to be considered are a, — a), a, — 3, 42-- Q3. 
Consider the following computations: 


. 5 2810 1 1 E 
g = — H + — 
n-k 27 n; nj 5 
αι 100 πρ. Xz 720 f, Χι 600. 


~ ~ > » - 60; 
πι 10 n, 10 ny; 10 
~ |1 1 
07002 = 2.052; 6 mcr 4.56. 
2 n 


The largest absolute difference between estimates is à; — ἆ- = 72— 60 = 12. Hence xum = 


2.63 > 2.052. This hypothesis is rejected. à, — @ = 10. 155 = 2.19 > 2.052. This is also 


rejected but Ηρ : a, — à; = 0 is not rejected. Hence the differences a, - a3 and a, - a3 
are contributing significantly towards the treatment sum of squares. 


Exercises 15.2 


15.2.1. Under the assumption that the errors in the one-way classification fixed effect 
model ej - N(0, 0^), are mutually independently distributed prove, by examining the 
corresponding quadratic forms or otherwise, that 


50 ~0°Xa-ı under Hy (i) 
s-s? ~o? under Hy (ii) 


and that s? and οὗ - s? are independently distributed. 


15.3 Randomized block design and two-way classifications —— 503 


15.2.2. Show that in Exercise 15.2.1, for s? to be σ’χ} κ. the null hypothesis need not 
hold or for the chi-squaredness of the least square minimum divided by σ΄ no hypothe- 
ses need to hold, and that 35 — s?, divided by σ2, will be a non-central chi-square in 
general and a central chi-square when Ηρ holds. 


15.2.3. Write the model in (15.1) as a general linear model of the form Y = Xf + e and 
show that the rank of the design matrix X is k, thereby X'X is singular. 


15.2.4. Set up the ANOVA table if the group sizes are equal to m and if there are k 
groups in a completely randomized experiment. 


15.2.5. Analyze fully the following one-way classification fixed effect data, which 
means to test the hypothesis that the treatment effects are equal and if this hypoth- 
esis is rejected, then test for individual differences and set up confidence intervals 
for the individual differences. Test at 596 level of rejection and set up 9596 confidence 
intervals. Data: 


Group1 10 12 15 10 25 13 
Group2 20 25 32 33 28 34 30 22 
Group3 5 8 2 4 6 8 4 


15.3 Randomized block design and two-way classifications 


If it is an agricultural experiment for checking the effectiveness of 10 different fertiliz- 
ers on the yield of sunflower seeds, then it may be very difficult to get n, + --- 4 njo =N, 
identical experimental plots. A few plots may be available in one locality where all 
the plots are homogeneous within the plots and between the plots. There may be a 
few other plots available in a second locality but between these two localities there 
may be differences due to the difference in the fertility of the soil in these two local- 
ities. Then we have two blocks of plots which are fully homogeneous within each 
block but there may be differences between blocks. Then there will be the treat- 
ment effect due to the fertilizers and a block effect due to the different blocking 
of experimental plots. If it is an experiment involving method of teaching, then it 
is usually difficult to come up with a large number of students having exactly the 
same backgrounds and intellectual capacities. In one school, the students of the 
same class may have the same background but if we take students from two differ- 
ent schools, then there may be differences in their backgrounds. Here, the schools 
will act as blocks. In the above two cases, we have two different types of effects, one 
due to the treatments and the other due to the blocks. If m blocks of n plots each 
are taken, where the plots are homogeneous within each block and if the n treat- 
ments are assigned at random to these n plots in each block then such an experiment 
is called a randomized block experiment. If xj is the observation on the j-th treat- 
ment from the i-th block, then we may write the simplest model in the following 


504 —— 15 Design of experiments and analysis of variance 


format: 


Xj-Htüjtfjtey i=1,...,m,j=1,...,n, ej ~ N(0,07) (15.14) 


i 
where for i= 1 we have 


Xy =H ται +P, +ey 


Xi 7 Ht 04 * By t+ ey 


Xin =H +Q + By + Cin 


and similar equations for i = 2,...,m, where μ is a general effect, a; is the deviation 
from the general effect due to the i-th block and f; being the deviation from the general 
effect due to the j-th treatment. The model in (15.14) is called the linear, fixed effect, 
two-way classification model without interaction, with one observation per cell. Here, 
“fixed effect" means that the a;'s and fj's are taken as fixed unknown quantities and 
not as random quantities. The word "interaction" will be explained with a simple 
example. In a drug testing experiment, suppose that 5 different drugs are tried on 4 
different age group of patients. Here, the age groups are acting as blocks and the drugs 
as treatment. The effect ofthe drug may be different with different age groups. In other 
words, if B; is the effect of the j-th drug, then β; may vary with the age group. There is 
a possibility of an effect due to the combination of the i-th block and j-th treatment, 
something like γη, an effect depending on i and j. If a joint effect is possible then the 
model will change to the following: 


Xij Ξμται * Bj 2 Vij F έν Εκ ος N(0,0?). (15.15) 


Since both yj and ej; have both the subscripts i and j, and since all quantities on the 
right are unknown, there is no way of estimating the joint effect γη because it cannot 
be separated from 6». Such joint effects are called interactions. If the interaction is to 
be estimated, then the experiment has to be repeated a number of times, say, r times. 
In this case, we say that the experiment is replicated r times. In that case, the k-th 
observation in the i-th block corresponding to the j-th treatment can be denoted by 
Xij ΟΙ the model can be written as 


Xiik =U+ aij Dj + Yi + Cii i=1,...,m,j=1,...,n, kz1...,r, ejr ~ N(0,0?). (15.16) 


In this case, we can estimate y; and test hypotheses on the interaction γη also. The 
model in (15.16) is called the two-way classification model with interaction and (15.14) 
is the two-way classification model without interaction. A randomized block exper- 
iment is conducted in such a way that there is no possibility of interaction between 
blocks and treatments so that the model in (15.14) is appropriate. When there is possi- 
bility of interaction, then the experiment has to be replicated so that one can use the 
model in (15.16) and complete the analysis. First, we will start with the model in (15.14). 


15.3 Randomized block design and two-way classifications —— 505 


15.3.1 Two-way classification model without interaction 


Consider a randomized block experiment where there is no possibility of interaction 
between blocks and treatments so that one can use the model in (15.14), which is an 
additive, fixed effect, two-way classification model without interaction. Suppose that we 
have one observation per cell or one observation corresponding to each combination 
of i andj. For estimating the parameters, we will use the method of least squares. That 
is, we consider the error sum of squares 


deg = 203 -n-a; - B)? 
ij ij 


differentiate with respect to μ, αι, Bj, equate to zero and solve. [This part is left as an 
exercise to the student.] We get the estimates as follows: 


~ X . 5 Xj . 
a= -A B-—-f 

n m (15.17) 
H mn i '" mn 


Since we have defined a; as the deviation from the general effect due to the i-th treat- 
ment, without loss of generality, we can take a = a, + -+ + am = 0, and similarly 6 = 0 
so that pi = x. The least square minimum, denoted by 5”, is given by the following 


x? 
where C.F = πο 


s! - y 03 - ji- â; - Bj? 
ij 


-Y(x- E ) (& x, ) (8 x, ) 
= D(x 2 Σ 

ij mn ij n mn ij m mn 
2 


-[X CF] Σ΄ CF] [x2 cr]. (15.18) 


l J 


The simplifications in (15.18) are given as exercises to the student. If we put the hy- 


pothesis that a, = 0 = --- = αμ, then the least square minimum, denoted by sé, will be 
the same s? as in (15.18), excluding the term 
: 2 2 
¥(%- A.) syE.cg (15.19) 
"n mn yn 


Hence the sum of squares due to the a;'s is given by (15.19). Similarly, the sum of 
squares due to β; 5 is given by 
2 


DE T ) ΞΣ = C.F. (15.20) 


ij j 


If we assume e; ~ N (0,0?) for all i and j and mutually independently distributed, 
then we can establish the following results by examining the corresponding quadratic 


506 —— 15 Design of experiments and analysis of variance 


forms: 
Sum of squares due to α; 5 
2 
= Y "Lo pc 0?y2. (A), A, =non-centrality parameter 
-n 
l 


ΞΟ} when a, 20----a, 
Sum of squares due to the Bj's 
(15.21) 


2 
x^ 
= 2: 9 — C.F = σ’χ} 4), A, = non-centrality parameter 
-m 
J 
=o°y?_, whenf,20--.-.-f, 


Total sum of squares = Y xj - C.F = o^yz, 1 
ij 
Least square minimum = s? = o^y6, qi. 
Further, it can be shown that the least square minimum or the residual sum of squares 
s? and the sum of squares due to the a;'s are independently distributed. Similarly, 
s? and the sum of squares due to the Bj's are independently distributed. But sum of 
squares due the a;’s and sum of squares due to β; 5 are not independently distributed. 


Thus the total sum of squares (S.S) can be split into the sum of the sum of squares due 
to a;’s, sum of squares due to the Bj's and the residual sum of squares. That is, 


Total S.S = S.S due to α; 5 
+ S.S due to β; 5 + residual S.S 


x dor ὅπ. 
re- SJ- BE- 

Xj )] 2 

EIE 15.22 
οί. mn) | ( ) 
Then, under the hypothesis a, = 0 = --- = a,, we have 


(S.S due to aj's)/(m -- 1) 


sS/(m-1m-2] - Fm-im-)(-) (15.23) 
and under the hypothesis βι = 0 =--- =8,, we have 
(S.S due to B;'s)/(n - 1) 
s?/[(m — D SUP Fr1(m-1)(n-1)- (15.24) 


We can use (15.23) and (15.24) to test the hypotheses on a;'s and β/5, respectively. The 
degrees of freedom for the residual sum of squares is obtained by the formula: 


mn-—1-(m-1)-(n-1)=mn-m-n+1=(m-1)(n-1). 


15.3 Randomized block design and two-way classifications —— 507 


The residual sum of squares can also be obtained in a similar fashion as the total sum 
of squares minus the sum of squares due to a,’s minus the sum of squares due to fj's. 
Then the analysis of variance table or ANOVA table for a two-way classification with 
one observation per cell can be set up as follows, where d.f = degrees of freedom, S.S = 
sum of squares, M.S = mean square, C.F = correction factor = £ v=(m-1)(n-1): 


mn? 


ANOVA table for a randomized block experiment 


Variation due to d.f S.S M.S F-ratio 
(1) (2) (3) (4) = 3)/0) 
2 
Blocks m-1 1% -CF A 4 = Fay 
2 
Treatments n-1 Σ) E - CF B B = ΕΠ ιν 
Residual ν (obtained by C 
subtraction) 
Total mn-1 i xj - C.F 


The last column in the ANOVA table is “Inference”, which is not shown in the 
above table due to lack of space. In the column on “Inference” write “significant” or 
*not significant". Here, "significant" means the hypothesis of no effect of the corre- 
sponding treatments is rejected, or we are saying that the contribution correspond- 
ing to the effects is significantly high compared to the residual sum of squares or the 
F-value is above the critical point. Similar is the inference on the significance of the 
block sum of squares also. 

Asin the one-way classification case, we can test for individual differences among 
a;’s as well as individual differences among β;5. This should be done only when the 
corresponding hypothesis is rejected or when the corresponding effect is found to be 
significantly high. If the block effect is found to be significantly high or if the hy- 
pothesis of no effect of the blocks is rejected, then test individual hypotheses of the 
type 


Hy :a;- a; =0 


by using the fact that the estimates of the effects a; and a; are linear functions of e,’s 
and, therefore, normally distributed under normality assumption for the e,’s. Hence 
we can use a Student-t test since the population variance o? is unknown. Note that 

a Xi Xj 


Ge Sa! 
07 n n 


508 ---- 15 Design of experiments and analysis of variance 


and hence under the hypothesis Hp : a; - a; = 0 


^ χι 

(àj-à)-O G w) Xi, — Xj, 

6X2) o2 Ine 
~Cm-nyn-1)- 


Hence the criterion will be to reject the hypothesis Ηρ : a; - a; = 0 if the observed value 


of 
50 κ a -0 
= b ος (45.25) 
rcm 
where 
diee Least square minimum = s? (15.26) 
(m - 1)(n- 1) (m - 1)(n - 1) 
and 


a 
Prts (1) 2 (mn), 2} = 5 


Start with the biggest absolute difference of à; - à; and continue until the hypothe- 
sis is not rejected. Until that stage, all the differences are contributing towards the 
significant contribution due to a;'s. Similar procedure can be adopted for testing in- 
dividual hypotheses on f; - B;. This is to be done only when the original hypothesis 
fı =0 = ~ = B, is rejected. Construction of confidence intervals can also be done as in 
the case of one-way classification. A 100(1 -- a)96 confidence interval for a; -- a; as well 
as for f; - β; are the following: 


x; X 2 


i. J. A 
LUE αὖ 15.27 
oa σα n0-9,59N- (15.27) 
Xi Xj = a 2 
μα ως aQ1|— 15.28 
m m πα” Vm ( ) 
where δ’ is given in (15.26). 
Note 15.2. Suppose that the hypothesis a, = 0 = --- =a, is not rejected but suppose 


that someone tries to test individual differences a; — a; = 0. Is it possible that some 
ofthe individual differences are significantly high or a hypothesis on individual dif- 
ference being zero is rejected? It is possible and there is no inconsistency because, 
locally, some differences may be significantly high but overall contribution may not 
be significantly high. The same note is applicable to β/5 also. 


Example 15.3. The following is the data collected from a randomized block design 
without replication and it is assumed that blocks and treatments do not interact with 


15.3 Randomized block design and two-way classifications —— 509 


each other. The data is collected and classified according to blocks B4, B5, B3, B, and 
treatments ΤΙ, Τ;, T}. Do the first stage analysis of block effects and treatment effects 
on this data. 


RE. πι 
B 1 5 8 
B, 6 4 2 
B. 2 4 4 
B, 5 4 5 


Solution 15.3. The number of rows m = 4 and the number of columns n = 3. The 
marginal sums are the following: x, = 14, x; = 12, x; = 10, x, = 14. x1 = 14, x? = 17, 
X319. x, 2 14 + 17 + 19 = 50. Then the correction factor 


2 2 
νο κο 20833. 
mn 12 
Then the sum of squares due to rows 
x? 1 
> - -CF- 5104? + (12)? + (10)? + (14)?] - C.F = 3.67. 
i 
Sum of squares due to treatments 
xj 1 2 2 2 
X = -CF = —[(14) + (17)? + (19)?] - C.F = 3.17. 
ym 4 
Total sum of squares 


Σχῇ -CF = (1) + (5)? + (8)? +--+ + (5)? - C.F = 39.67 
ij 
Residual sum of squares 


s? = 39.67 - 3.17 - 3.67 ~ 32.83. 


Then the analysis of variance table can be set up as follows: 


Variation d.f 5. M.S F-ratio 
due to (3)/(2) = 
(1) (2) (8) (4) 
Blocks 3 3.67 1.22 0.22 = Ες 


N 


Treatments 3.17 1.59 0.29 = F26 
Residual 6 32.83 5.47 


Total 11 39.67 


510 ---- 15 Design of experiments and analysis of variance 


Let us test at a 506 level of rejection. The tabulated values are the following: 
Fy 60.05 = 5.14, F; 6.0.05 = 4.76. 


Hence the hypothesis that the block effects are the same is not rejected. The hypothesis 
that the treatment effects are the same is also not rejected. Hence the contributions 
due to a;'s as well as due to β;5 are not significant. Hence no further analysis of the 
differences between individual effects will be done here. 


15.3.2 Two-way classification model with interaction 


As explained earlier, there is a possibility that there may be a joint effect when two sets 
of treatments are tried in an experiment, such as variety of corn (one set of treatments) 
and fertilizers (second set of treatments). Certain variety may interact with certain fer- 
tilizers. In such a situation, a simple randomized block experiment with one observa- 
tion per cell is not suitable for the analysis of the data. We need to replicate the design 
so that we have r observations each in each cell. We may design a randomized block 
experiment to replicate r times. Suppose that it is an experiment involving planting of 
tapioca. Animals like to eat the tapioca plant. Suppose that in some of the replicates 
a few plots are eaten up by animals and the final set of observations is of the form of 
ni observations in the (i,j)-th cell, where the n;'s need not be equal. This is the gen- 
eral situation of a two-way classification model with multiple observations per cell. 
We will consider here only the simplest situation of equal numbers of observations 
per cell, and the general case will not be discussed here. The students are advised to 
read books on design of experiments for getting information on the general case as 
well as for other designs and also see the paper [1]. 

Consider the model of (15.16) where the k-th observation on (1, j)-th combination 
of the two types of treatments be xy, i2 1,..., m, j2 L ...,n, k =1,...,r. Then a linear, 
fixed effect, additive model with interaction is of the following type: 


Xijk =H ται + Bj + Vi + Cio 

i=1,...,m,j=1,...,n, k=1,...,1, ej ~ N(0,0°) (15.29) 
-Hjtej. My =H+ 0, + By + Vy (15.30) 
where μ is a general effect, αι is the deviation from the general effect due to the i-th 
treatment of the first set, B; is the deviation from the general effect due to the j-th 
treatment of the second set, γη is the interaction effect and ej, is the random part. 
Here again, without loss of generality, we may assume a = 0, B. = 0, y, =0. From the 
model (15.30), one can easily compute the residual sum of squares. Consider the error 

sum of squares as 


3327 = $ Cin = ui. (15.31) 


ük ijk 


15.3 Randomized block design and two-way classifications —— 511 


By differentiating the left side of (15.31) with respect to Hi and equating to zero, one 
has 


ο Ip Xij. 
ij = SR — P (15.32) 


Hence the least square minimum, 57, is given by 


ük 
» x., ) yo x. ) 
E: UK ^ mnr Ar mir 
2 2 2 
-{ xà n) (Σ5 x. ) (15.33) 
ik JF mnr yr omn 


The first hypothesis to be tested is that Ηρ : γη = 0 or there is no interaction. Then the 
model is 


Xijk =H + Qi + Bj + eijk 


and the least square minimum, denoted by οὖ, is given by 


Blo de) BEd) 


ijk ijk nr mnr 
X; X 2 
z y(= zs s ) (15.34) 
ijk mr mnr 


Therefore, the sum of squares due to interaction is given by 


icr ena sr 


ijk ijk nr mnr 
X; X 2 χ 2 Xs X 2 
σε μα στ ex Ems) 
ijk ijk ijk 
-y(% x. i »(& a ) 
ijk r mnr ijk nr mnr 
y 2 
7 y (= - A) 
ijk mr mnr 


2 


Iya cr] be CF] pa cF] (15.35) 


ij j 


512 —— 15 Design of experiments and analysis of variance 


where C.F - E When the interaction y; = 0, then there is meaning in estimating a;'s 
and β;5 separately. When the interaction effect y; is present, then part of the effect due 
to the i-th treatment of the first set is mixed up with γη. Similarly, part of the effect of 
thej-th treatment of the second set is also mixed up with y;; and hence one should not 
try to estimate a; and β; separately when y; is present. Hence, testing of hypotheses 
on a;'s should be done only if the hypothesis yj = 0 is not rejected; similar is the case 
for testing hypotheses on β; 5. 

Now we can check the degrees of freedom for the various chi-squares when ej, ~ 
N(0,0?) for all ij, k, and mutually independently distributed. The sum of squares due 
to interaction is denoted by S.S.(int) and it is given by 


2 


sino = [y CF] be CF] [x2 CF] (15.36) 
i j 


ij 


with degrees of freedom [mn - 1] - [m - 1] - [n - 1] = (m - 1)(n - 1). The residual sum of 


2 
squares is [Σῃκ Xx -C.F]- [X iss C.F] with degrees of freedom [mnr -- 1] - [mn - 1] = 


jr 
mn(r - 1). Once yj = 0, then the sum of squares due to a; is [Σ; xL — C.F] with degrees of 
EE i xi 
freedom [m — 1]. Similarly, once y; = 0 then the sum of squares due to β; is [Σ jm CF] 


with degrees of freedom [n -- 1]. Now, we can set up analysis of variance table. 

In the following table, Set A means the first set of treatments and Set B means the 
second set of treatments, C.F = correction factor = AL p=mn(r —1), is the degrees of 
freedom for the residual sum of squares, v = (m - 1)(n - 1) is the degrees of freedom for 
the interaction sum of squares, and interaction sum of squares, which is given above 


in (15.36), is denoted by S.S(int). 


ANOVA for two-way classification with r observations per cell 


Variation d.f S.S M.S F-ratio 
B) 
due to δ) 
(1) (2) (3) =(4) 
Set A m-1 «ΟΕ A 4=Fuiy 
2 
Set B n-1 X -CF B -ξειρ 
Interaction (m-1)(n-1) S.S(int) C 5=Fip 
2 
Between cells mn-1 yy Ἔ -C.F 
Residual p (by subtraction) D 


Total mnr - 1 Din Xk - CF 


15.3 Randomized block design and two-way classifications —— 513 


Note 15.3. There is a school of thought that when the interaction effect is found 
to be insignificant or the hypothesis Ηρ : γη = 0 is not rejected, then add up the 
sum of squares due to interaction along with the residual sum of squares, with the 
corresponding degrees of freedoms added up, and then test the main effects a,’s 
and β/5 against this new residual sum of squares. We will not adopt that procedure 
because, even though the interaction sum of squares is not significantly high it does 
not mean that there is no contribution from interaction. Hence we will not add up 
the interaction sum of squares to the residual sum of squares. We will treat them 
separately, even if the interaction effect is not significant. If insignificant then we 
will proceed to test hypotheses on the main effects a;'s and β;5, otherwise we will 
not test hypotheses on a;'s and fs. 


Individual hypotheses on the main effects a;’s and f;'s can be tested only if the 
interaction effect is found to be insignificant. In this case, 


tp uw 
. = = m >= ~ tmnír-1) (15.37) 
σ γ nr σ \2 
under the hypothesis ας — a, = 0, where 
., Least square minimum s? 
ó^- = à 
mn(r — 1) mn(r — 1) 


A similar result can be used for testing the hypothesis f, — B, = 0. The dividing factor 
in this case is mr instead of nr. The confidence interval can also be set up by using the 
result (15.37) and the corresponding result for f "E β i 


Example 15.4. A randomized block design is done on 3 blocks and 4 treatments and 
then it is replicated 3 times. The final data are collected and classified into the follow- 
ing format. Do an analysis of the following data: 


T, T, T; T, T, T; T, T, T; T, T, T; 
By 1 4 2 B, 5 6 4 B, 8 6 4 B, 1 2 2 

2 2 3’ 6 8 ο) 5 5 6 2 1 

3 4 4 4 5 5 6 5 4 1 3 2 


Solution 15.4. Here, i=1,2,3,4 or m=4, j =1,2,3 orn =3, k =1,2,3 or r 2 3. 


Xy, 76, Χρ 210, X53,28,X4 = 15, Χ2Ξ19, Χρ = 18 


X31, =19, X32 = 16, X33, = 14X%y =% X4, =6, X43, =5 


Total = Y; xij, = 140. Hence C.F = S. = (40 x 544.44 


mnr 36 


2 
Block S.S = Y * - C.F 
ij nr 


514 —— 15 Design of experiments and analysis of variance 


- 512%? + (52)2 + (49)? + (15)?] - C.F = 11178 
2 
Treatment S.S = Y δν ΟΕ 
j mr 


= = (44)? + (51)? + (45)?] - C.F = 2.39 
xê 1 
Σ ps -CF- 510? + (10)? + --- + (5)?] - C.F = 12223 
ij 
Residual S.S. 166.56 - 122.23 = 44.33 
Total S.S. = Σ᾽ χῇχ - C.F = 166.56 
ijk 


Then the analysis of variance table is the following: 


Variation d.f S.S M.S F-ratio 
due to (3/0) = 
(1) 0 ) (4) 
Blocks 3 11178 3726  20.14= F; 
Treatments 2 2.5 1.25 0.68 = Fy 5, 


Interaction 6 795 1.34 4.30 = Fej; 


Between cells 11 122.23 
Residual 24 44.33 1.85 


Total 35 166.56 


Let us test at a = 0.05 level of rejection. Then F5, 9 95 = 2.51 < 4.30 from tables, and 
hence the hypothesis of no interaction is rejected. Since this hypothesis is rejected, 
there is possibility of interaction, and hence we cannot test any hypothesis on the 
main effects or on a;'s and f,’s, and hence we stop the analysis here. 


Exercises 15.3 


15.3.1. When e; ~ N(0, 0?) and independently distributed in a two-way classification 
linear fixed effect model without interaction, with i = 1, ...,m, j = 1, ..., n prove the fol- 
lowing: 


: 2 2 2 
(i) The block sum of squares = Y (3 - ~..) -Σ---5. 
yn mn τη mn 

=A ~ 0) 


under Ηρ: a, =0= -= Am 


15.3 Randomized block design and two-way classifications —— 515 


X; 2 x4 2 
(ii) Treatment sum of squares = »( RSS ) - Σ da 
7\m mn ym mn 
-B- σχπι 
under Ho : B4 20- -- 2 f, 


2 
(iii) Residual sum of squares = C = Συ - x.) -A-B 
7 mn 
2 

-Yxj- 5 -A-B 

J mn 

ος OX m-1)(n-1) 

(iv) A and C are independently distributed, and 


(v) Band C are independently distributed. 
Hint: Write in terms of e,’s and then look at the matrices of the corresponding 
quadratic forms. 


15.3.2. Consider the two-way classification fixed effect model with interaction as in 
equation (15.16). Let eijk ~ Ν(0,σ2), i=1,...,m, j =1,...,n, k =1,...,r, and indepen- 
dently distributed. By expressing various quantities in terms of e;,’s and then study- 
ing the properties of the corresponding quadratic forms establish the following re- 
sults: 


i Residual sum of - Xi. 
(i) Residual sum o squares = δ᾽ ΗΡΙ 


-[Σ{«- ἀν) -[Σ(:- ἄς} 


2,2 
=D~o Xmn(r-1) 


2 
(ii) Total sum of squares = Ys. - =.) MON ip 


x mnr 

2 2 2 2 

m : Xij, χ Xi X 
(iii Interaction sum of squares = ( X B 7 ) ey i. ΕΞ ) 
r mnr “nr mnr 


x O^ XCn-1)n-1) 
when y; = 0 for all i and j. 


(iv) when γῇ = 0 for all i and j then A and D as well as B and D are independently 
distributed, where 


516 —— 15 Design of experiments and analysis of variance 


15.3.3. Do a complete analysis of the following data on a randomized block experi- 
ment where B,, Β;, B denote the blocks and ΤΙ, T5, Tj, T, denote the treatments. The 
experiment is to study the yield x; of beans where the 3 blocks are the 3 locations and 
the 4 treatments are the 4 varieties. If the block effect is found to be significant then 
check for individual differences in a;'s. Similarly, if the treatment effect is found to be 
significant then check for individual differences, to complete the analysis. 


ph xdg 


15.3.4. Doa complete analysis of the following data on a two-way classification with 
possibility of interaction. The first set of treatments are denoted by A,,A,,A3 and 
the second set of treatments are denoted by B,, B,,B3,B, and there are 3 replications 
(3 data points in each cell). If interaction is found to be insignificant, then test for the 
main effects. If the main effects are found to be contributing significantly, then check 
for individual differences. 


B, B, B, B, B, B, B, B, B, B, B, B, 
A 2 AR dO 1 A4 5 6 55 ολ 1 0 2 3 
11 10 10 1 6 6 5 5 2 1 0 1 
10 12 12- 12 5 6 6 6 3 1 1 2 


15.4 Latin square designs 


Definition 15.2 (A Latin Square). A Latin Square is a square arrangement of m 
Latin letters into m rows and m columns so that each letter appears in each row 
and each column once and only once. 


Consider the following arrangements of 3 Latin letters, 3 Greek letters and 3 num- 
bers in to 3 rows and 3 columns: 


A B C a py 

M,=B CA  Mj,-y a f, (15.38) 
C A B B ya 
1 2 3 Aa Bp Cy 

M,-2 3 1,  Mp-By Ca AB (15.39) 
3.1 2 CB Ay Ba 


Note that Μι, M5, M; are all Latin squares, M, has Greek letters and M; has numbers in 
the cells but all satisfy the conditions in the Definition 15.2. Note that Μι, is in the form 


15.4 Latin square designs —— 517 


of M, superimposed on Μι. In this superimposed structure, every Greek letter comes 
with every Latin letter once and only once. If there are two Latin squares, which when 
superimposed, have the property that every letter in one square comes with every let- 
ter in the other square, once and only once, then such squares are called orthogonal 
Latin squares or Greek-Latin squares. There are some results on the maximum number 
of such orthogonal squares possible for a given m. The maximum possible is evidently 
m — 1 but for every given m these m - 1 orthogonal squares may not exist. Construction 
of all orthogonal squares for a given m is an open problem. Once in a while people 
come up with all squares for a new number m. 

Here, we are concerned about using a Latin square for constructing designs called 
Latin square designs. In a Latin square design, we will assign one set of treatments 
corresponding to rows, one set of treatments corresponding to columns, one set of 
treatments corresponding to Latin letters. If orthogonal squares are available, then 
additional sets of treatments corresponding to the letters in each orthogonal square 
can be tried. The total number of cells is only m? or by using m? experimental plots one 
will be able to test hypotheses on different sets of m treatments each. This is the main 
advantage of a Latin square design. The trade-off is that all the treatment sets must 
have equal number of treatments or m treatments in each set. Another drawback is 
that in the analysis, there is no provision for interaction, and hence do not conduct 
an experiment with the help of a Latin square design if the different sets of treatments 
are likely to interact with each other, or if effects due to combination of treatments is 
likely to be present then do not use a Latin square design. Let us start with one Latin 
square with m sides, something like M, in (15.38). A model that we can take is the 
following: 


Xj =H ται + Bj +Y + eja (15.40) 


where Xij 15 the observation in the (i, j)-th cell if the k-th letter is present in the (i, j)-th 
cell. For example, in the illustrative design M, the letter A or the first letter appears, 
for example, in the first row first column cell. Hence x4, is there whereas xq; and 
X33 are not there since the letters B and C do not appear in the (1,1)-th cell. This is 
the meaning of the subscript k put in brackets. Since every letter is present in every 
row and every column, when we sum up with respect to i, row subscript, k is auto- 
matically summed up. Similarly, when we sum up with respect to column subscript 
j the subscript k is also automatically summed up. Let us use the following nota- 
tions. 


R; = i-th row sum; Cj =j-th column sum; T,= k-th letter sum. 


For calculating R;, sum up all observations in the i-th row. Similarly, sum of all ob- 
servations in the j-th column to obtain C;. But Τι is obtained by searching for the k-th 
treatment in each row and then summing up the corresponding observations. By com- 
puting the least square estimates of the effects and then substituting in the error sum 


518 —— 15 Design of experiments and analysis of variance 


of squares, we obtain the least square minimum s?. Then put the hypothesis that the 
first set of treatment effects are zeros or Ηρ : a, = O=--- = a,,. Compute the least square 
minimum under this hypothesis, så. Then take så — s? to obtain the sum of squares 
due to the rows or due to α; 5. Similarly, by putting Πρ: 6, 20 =--- = m compute the 
least square minimum under this hypothesis. Call it sj). Then οὖρ -- s? is the sum of 
squares due to the second set of treatments or columns. By putting Hy : y, =0 =- - yq 
and taking the difference between the least square minima, one under the hypoth- 
esis and one without any hypothesis, we obtain the sum of squares due to y;’s or 
the third set of treatments. The sum of squares can be simplified to the following, 
where the degrees of freedom corresponds to the degrees of freedom (d.f) associated 
with the corresponding chi-squares when it is assumed that the e,’s are indepen- 


dently distributed as ej ~ Νίο, o*). Here, C.F = x = correction factor and S.S = sum 
of squares. 


2 
2: Br -C.F = S.S due to rows, with d.f-m-1 
m 


i 


C2 
Σ -L -C.F =S.S due to columns, withd.f-m-1 (15.41) 
-m 
j 
T2 
pi -k -C.F =S.S due to letters, with d.f- m - 1. 
~m 


The total sum of squares is Σὴ χῇ -C.F with d.f = m? - 1and the residual sum of squares 
s? - the total sum of squares minus the sum of squares due to rows, columns, and 
letters or the three sets of treatments, with degrees of freedom v = (m? - 1) -3(m- 1) = 
(m — 1)(m -- 2) for m > 3. By using the above observations, one can set up the analysis 
of variance or ANOVA table for a simple Latin square design, where the three sets of 
treatments, one corresponding to the rows, one corresponding to the columns and one 
corresponding to the letters, are such that there is no possibility of interactions among 
any two of them. 

One more set of treatments can be tried if we have a pair of orthogonal designs or 
if we have a Greek-Latin square as M, of (15.39). In this case, the model will be of the 
form: 


Xi(t) =H + Qj + Bj + yy + Ôt + Ej (15.42) 


where i=1,...,m, j=1,...,m, k2 1...,m, t=1,...,m. When we sum up with respect 
to i or j, automatically k and t are summed up. In this case, corresponding to (15.41), 
there will be one more sum of squares due to the fourth set of treatments, denoted 
by Σι Us — C.F with degrees of freedom m - 1 again. The correction factor remains the 
same as above. In this case, the degrees of freedom for the residual sum of squares is 
v= m? -1-4(m-1)= (m -1)(m -- 3) for m > 4. The analysis of variance table for the 


simple Latin square design and the Greek-Latin square designs are the following: 


15.4 Latin square designs —— 519 


ANOVA table for a simple Latin square design 


Variation dueto df S.S M.S F-ratio 
(1) (2) (3) (4) = (3)/(2) 
Rows m-1 i Ri -C.F A 4 = Επ 1ν 
Columns m-1 Y 5 -C.F B 5 = Επ ιν 
Letters m-1 Y, i -C.F C S = Επ ιν 
Residue ν s? D 
Total m^-1 Y.x;-C.F 


where v = (m? - 1) - 3(m - 1) = (m - 1)(m - 2). 


ANOVA table for a Greek-Latin square design 


Variation dueto d.f S.S M.S F-ratio 
(1) (2) (3) (4) = (3/0) 
Rows m-1 i Ri -C.F A : E Foy, 
Columns m-1 » E -C.F B E ΞΕπ ιν 
Latin letters m-1 Y, T -C.F C £ = Fay, 
Greek letters m-1 " E -C.F D 2 = Επ ty, 
Residue γι 57 Ε 
Τοία] m^-1 Y.x;-C.F 


where v, = (m - 1)(m -- 3), m 2 4. If we have more orthogonal designs or a set of n 
orthogonal designs, then by using one set of n orthogonal designs we can try (n + 2) 
sets of treatments by using m? test plots for m > n + 2. The procedure is exactly the 
same. The residual degrees of freedom in this case will be v, = (m? -1) -(n42)(m- 1) = 
(m-1)(m -n- 1). One illustrative example on a simple Latin square design will be given 
here. 


Example 15.5. The following is the design and the data on a simple Latin square de- 
sign. Do the analysis of the data. 


Sum 
A BC 1 5 4 10 
B C A 2 6 7 15 
C A B 5 2 4 11 


Sum 8 13 15 36 


520 —— 15 Design of experiments and analysis of variance 


Solution 15.5. According to our notation the row sums, denoted by Δι, R}, R5, column 
sums, denoted by C}, C2, C and sums corresponding to letters, denoted by Τι, Τ2, T; are 
the following: 


R,=10, R,-15 Βι-11 
6:58, C-13, C4-15 
T,-10, T,-1, Τι-15 


Note that x. = 36, and hence the 


2 2 
erst) 


= 144. 
m2 9 


Also 


2 

S.S due to rows = $ A e. C.F « 144.67 
Τι 
χ2 

S.S due to columns - Σ 3 _C.F = 148.67 
ym 
T2 

S.S. due to letters = >: K = 144.67 

m 


k 


Hence the ANOVA table is the following: 


Variation dueto df SS M.S F-ratio 


Rows 2 4.67 2.32 0.17 
Columns 2 867 432 0.31 
Letters 2 467 2.32 0.17 
Residue 2 13.99 
Total 8 32 


Let us test at a = 0.05. Then the tabulated value of Ε22ρος5 = 19. Hence the hypoth- 
esis Ηρ : à, = à = @ = 0 is not rejected or the row effect is not significant. Simi- 
larly, it can be seen that the column effect as well as letter effect are not significant 
here. 


Exercises 15.4 


15.4.1. For a Latin square design of side m, and with normality assumption for the 
errors [6ῃ ~ N(0, 0?) and mutually independently distributed] show that the residual 


15.5 Some other designs —— 521 


sum of squares s? ~ o?y0, y. and that 


2 
XX cro, under Ho : a; 20-7. =Qy 
i 

C2 
Y 2 -CF~ oxi under Hy: ,=0=---=8 
-m ; m-1 0251 πι 
j 

Τί 3:2 
πι ΟΕ 0a under Ho : y1 =0=--=Ym 


k 


where C.F = correction factor = x, R; = i-th row sum, C; = j-th column sum and Τι is 
the sum of the observations corresponding to the k-th letter. 


15.4.2. In Exercise 15.4.1, show that s? and sum of squares due to rows are indepen- 
dently distributed and so is the case of column sum of squares and sum of squares 
corresponding to letters. 


15.4.3. Do a complete analysis of the following data where the design and the corre- 
sponding data are given: 


A B C D 5 8 2 6 
B CDA 4 2 1 5 
C DA P 3.8 2 4 
D ABC 2 5 2 6 


15.5 Some other designs 


There are several other designs in practical use, such as incomplete block designs, bal- 
anced incomplete block designs, partially balanced incomplete block designs, Youden 
square designs, factorial designs, etc. 


15.5.1 Factorial designs 


In drug testing experiments, usually the same drug at different doses are adminis- 
tered. If two drugs at 5 different doses each are tried in an experiment, then we call 
it two factors at 5 levels and write as 2? design. If m factors at n levels each are tried 
in an experiment then the design is called a m" factorial design. If m, factors at n, 
levels each, ..., m, factors at n; levels each are tried in an experiment, then we call it 
amy- my factorial design. The analysis of factorial designs is completely different 
from what we have done so far, because there can be all sorts of effects such as linear 
effects, quadratic effects and so on, as well as different categories of interactions. This 
is an area by itself. 


522 ---- 15 Design of experiments and analysis of variance 


15.5.2 Incomplete block designs 


In a randomized block experiment, each block has t plots and only t treatments are 
tried, or in other words, all treatments appear in every block. Most of the times it may 
be difficult to find sets of t plots each which are fully homogeneous within each block. 
In an experiment involving animals, it may be difficult to find a large number of iden- 
tical animals with respect to genotype and other characteristics. In such a situation, 
we go for incomplete block designs. In each block, there will be s homogeneous plots, 
s«t,and take b such blocks so that bs plots are there. Then the t treatments are ran- 
domly assigned to the plots so that each treatment is repeated r times in r different 
blocks or bs - rt. Then such a design is called an incomplete block design. We may 
put other restrictions such as each pair of treatments appear A times or the i-th pair 
is repeated A; times and so on. There are different types of balancing as well as par- 
tial balancing possible and such classes of designs are called balanced and partially 
balanced incomplete block designs. 

Sometime, in a Latin square design, one or more rows or columns may be fully 
lost before the experiment is completed. The remaining rows and columns, of course, 
do not make a Latin square. They will be incomplete Latin squares, called Youden 
squares. Such designs are called Youden square designs. 


15.5.3 Response surface analysis 


In all the analysis of various problems that we have considered so far, we took the 
model as linear additive models. For example, in a one-way classification model we 
have taken the model as 


Xij Ξμται-τ ej (15.43) 


where μ is the general effect, a; is the deviation from the general effect due to the 
i-th treatment and e; is the random part. Here, x;, the observation, could be called 
the response to the i-th treatment. In general, x; could be some linear or non-linear 
function. Let us denote it by $. Then 


Xij = (u, αι, ej). (15.44) 


Analysis of this non-linear function $ is called the response surface analysis. This is 
an area by itself. 
15.5.4 Random effect models 


So far,we have been considering only fixed effect models. For example, in a model 
such as the one in (15.43), we assumed that a; is fixed unknown quantity, not another 


15.5 Some other designs —— 523 


random variable. If αι is a random variable with E(a;) = 0 and Var(a;) = 07, then the 
final analysis will be different. For simplifying matters, one can assume both αι’ and 
6/5 are independently normally distributed. Such models will be called random effect 
models, the analysis of which will be more complicated compared to the analysis of 
fixed effect models. Students who are interested in this area are advised to read books 
on the Design of Experiments and books on Response Surface Analysis. 


16 Questions and answers 


In this chapter, some of the questions asked by students at the majors level (middle 
level courses) are presented. At McGill University, there are three levels of courses, 
namely, honors level for very bright students, majors level for average students in 
mathematical, physical and engineering sciences and faculty programs for below av- 
erage students from physical, biological and engineering sciences, and students from 
social sciences. Professor Mathai had taught courses at all levels for the past 57 years 
from 1959 onward. Questions from honors level students are quite deep, and hence 
they are not included here. Questions from students in the faculty programs are al- 
ready answered in the texts in Chapters 1 to 15 and in the comments, notes and remarks 
therein. Questions at the majors level that Professor Mathai could recollect and which 
are not directly answered in the texts in Chapters 1 to 15 are answered here, mostly 
for the benefits of teachers of probability and statistics at the basic level and curious 
students. [These materials are taken from Module 9 of CMSS (Author: A. M. Mathai), 
and hence the contexts are those of Module 9.] 


16.1 Questions and answers on probability and random variables 


Question. Can a given experiment be random and non-random at the same time? 


Answer. The answer is in the affirmative. It all depend upon what the experimenter 
is looking for in that experiment. Take the example of throwing a stone into a pond of 
water. If the outcome that she is looking for is whether the stone sinks in water or not, 
then the experiment is not a random experiment because from physical laws we know 
that the stone sinks, and hence the outcome is pre-determined, whereas if she is look- 
ing for the region on the surface of the pond where the stone hits the water, then it be- 
comes a random experiment because the location of hit is not determined beforehand. 


Question. What are called, in general, postulates or axioms? 


Answer. Postulates or axioms are assumptions that you make to define something. 
These assumptions should be consistent and mutually non-overlapping. Since they 
are your own assumptions, there is no question of proving or disproving these postu- 
lates or axioms. 


Question. In some books, a coin tossed twice and two coins tossed once are taken as 
equivalent. Is this correct? 


Answer. No. If the two coins are identical in every respect and the original act of 
tossing twice and the second act of tossing once are identical in every respect, then 
the two situations can be taken as one and the same, otherwise not, usually not. 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. JBA] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https: //doi.org/10.1515/9783110562545-016 


526 —— 16 Questions and answers 


Question. In the situation of a random cut of an interval, the probability that the cut 
is at a given point is zero. Does it mean that it is impossible to cut the string? The child 
has already cut the string! Is there any contradiction here? 


Answer. According to our rule here, we assigned probabilities proportional to length. 
Since a point has zero length, the assigned probability is zero. Probability of the im- 
possible event φ is zero but if the assigned probability of an event is zero this does not 
mean that the event is impossible. We can cut the string. 


Question. Does pair-wise independence imply mutual independence? For example, 
if A, B, C are three events in S and if P(A n B) = P(A)P(B), P(ANC) = P(A)P(C), P(BNC) = 
P(B)P(C), then does this imply that P(A n Bn C) = P(A)P(B)P(C) also? 


Question. The second doubt is that if P(A n Bn C) = P(A)P(B)P(C) will it not be suffi- 
cient for pair-wise independence also? 


Answers. Apparently, in some books, P(An Bn C) = P(A)P(B)P(C) is stated as imply- 
ing that they are pair-wise independent also. This is incorrect. In the following figures, 
suppose that symmetry is assumed in the sample space S and S contains only a finite 
number of elementary events. Figure 16.1 (a) is an illustration showing that pair-wise 
independence does not imply mutual independence. Figure 16.1 (b) shows that if PPP 
(product probability property) holds for three events then that need not imply pair- 
wise independence. 


Figure 16.1: Pairwise and mutual independence. 


In Figure 16.1(a), P(A) = 5 = P(B) = P(C), P(A n B) = ; = P(A)P(B), P(A n C) =  - 
P(A)P(C), P(Bn C) = i = P(B)P(C), and hence pair-wise independence holds but P(An 
BnC)- $ + P(A)P(B)P(C) = $. Hence pair-wise independence need not imply mu- 
tual independence. In Figure 16.1 (b), P(A) = 1 P(B)- 3, P(C)- 5 and P(An Bn C)- 
5 = P(A)P(B)P(C). But Ρ(Β η C) = $ + P(B)P(C) = ξ. Hence if PPP (product probabil- 
ity property) holds for three events, PPP need not hold pair-wise. If there are k events 


A; CS, iz 1,...,k for mutual independence to hold, one must have PPP holding for 


16.1 Questions and answers on probability and random variables —— 527 


all possible intersections of different events, that is, all intersections two at a time, all 
intersections three at a time, ..., intersection of all or k at a time. 


Question. Will the probabilities P(A,|B), P(A;|B),..., P(A,|B) sum up to 1 when 
A, ..., Ay constitute a partitioning of the sample space and B is any other event in 
the same sample space? 


Answer. Note that 


P(A; NB) 


Ρ(ΑΙΒ)- ES 


, P(B)#0, j=1,...,k. 
Hence by taking the sum for P(B) 40, 


1 P(B) 
P(A,IB) + -- + P(ALIB) = ——[P(A, η B) +--+ P(A, NB)] = —— = 
(A, |B) (ΙΒ) 7 gg; PU ) (Ax n Β)] P(B) 
since A, n B,..., Αχ n B are the mutually exclusive partitions of B and their union is 
B itself. Note that in the above results and procedures k need not be finite. There can 
be a countably infinite number of events Αγ, A), ... in the partition and still the results 
will hold. 


Question. Are the procedures of sampling without replacement (taking one by one 
without putting back the one already taken) and the procedure of taking one subset at 
random, one and the same in the sense that both the procedures give rise to the same 
probability statements in whatever the computations that we are going to do? 


Answer. Yes. Let us look at the above example. [A box contains 10 red and 8 green 
identical marbles and marbles are taken at random.] What is the probability of getting 
exactly 2 red and 1 green marbles? Consider the first procedure of taking one subset 
of 3. Then the 2 red can come from the 10 red in (19) ways and the one green in ($) 
ways. Let D be the event of getting exactly 2 red and 1 green marbles. Then in the first 
situation of taking one sample of 3 is given by 


pip) = 94). 10x98 3! x E 
(8) 2! 118x17x16 ^ -18 1 161 


Now, let us consider the second procedure oftaking oneata time without replacement. 
Let A be the event that the first marble is red, B be the event that the second marble is 
red and C be the event that the third marble is green. Then the intersection An Bn C is 
the event of getting the sequence RRG (red, red, green). Then the probability of getting 
this sequence is given by the following by using the splitting ofthe intersections with 
the help of the definition of conditional probabilities: 


P(AnBnO)- P(A)P(BIA)P(CIA NB) = » QE es 


528 ---- 16 Questions and answers 


because for the first trial there are 18 marbles and out of which 10 are red and the 
selection is done at random and hence the probability is D. When one red marble 
is removed then there are 9 red marbles left. Hence, given A, the probability for B or 
Ρ(ΒΙΑ) = A When two red marbles are removed there are 16 marbles, out of which 8 
are green. Then the probability of C, given that two red marbles are taken, or given 
ANB, is P(C|A N B) = i. Then from the multiplication rule of intersection and condi- 
tional statement above the probability for the sequence RRG 


P(RRG) = P(ANBNC)= 


om o ed 
18 17 16 .18x17x16/l 


How many such sequences are there with two reds and one green? RRG, RGR, GRR or 
three such sequences are there. For each such sequence, we see that the same proba- 
bility as above appears. Hence the required probability in sampling without replace- 
ment scheme is 315 x 2 x £l. This is the same result as in the case of taking one 
subset of 3 from the whole set. Hence the two procedures will lead to the same result. 
From the steps above, it is seen that this is true in general also. 


Question. One doubt of the students is that why do we write “zero elsewhere" when 
writing a probability or density function? Is it necessary? 


Answer. Since a real random variable is defined over the whole real line, the den- 
sity should also be defined over the whole real line. Hence the non-zero part is to be 
mentioned as well as the zero part is to be mentioned. 


Question. Another serious doubt is whether the boundary point, in the above case of 
a three-parameter gamma density the lower boundary point x = y is to be included in 
the non-zero part or in the zero part since the probability at χΞ y is zero in any case? 
Should we write the range for the non-zero part as y <x < oo oras y < x < co? 


Answer. Suppose that x - y is not included in the non-zero part. What is the maxi- 
mum likelihood estimate (MLE) of the parameter y? It does not exist if the point x = y 
is notincluded in the non-zero part. It exists and is equal to the smallest order statistic 
if x = y is included in the non-zero part or if the support is given as y x x < oo. Since co 
is not a number or a point, the upper boundary point does not arise here. 

As another example consider a uniform density: 


1 
ο. a<x<b 


0, elsewhere. 


If the non-zero part is written as for a « x x b, then the MLE for the parameter b ex- 
ists but the MLE for a does not exist. If the support is written as a < x < b, then the 
MLE for a exists but the MLE for b does not exist. If the range is written as a « x « b, 


16.1 Questions and answers on probability and random variables ---- 529 


then MLE for a and b do not exist. If the range is written as a x x x b, then the MLE 
for both a and b exist and they are the smallest order statistic and the largest or- 
der statistic, respectively. Hence the range for the non-zero part must be written as 
a <x < b even though the probabilities x = a and x = b are zeros. The rule to be fol- 
lowed is that the non-zero part of the density or probability function has to be written 
including the boundary points of the support or the interval where the non-zero part 
is defined. 


Question. What about the condition a > 0, where a is the shape parameter in a 
gamma density? From where is this condition coming? 


Answer. The function x^-!e ? is a smooth function (no singularities or the function 
does not become zero in the denominator at any point) if x is away from 0 and co. When 
integrating over [0, co) when x — oo, the integral behaves like the integral of e$ ,p»0 
since the polynomial part x^^! is dominated by the exponential part e ?. But when 
X oo, e goes to zero, which is finite since β > 0. Hence when x oco there is no 
problem with the integral. When x — 0, e ? — 1. Hence the difficulty may come only 
from the factor x^^. In the integral, x^! behaves like E Hence a 40. If a is negative, 
then x? behaves like $, y = -a > 0 and E goes to = oo when x  0,. Hence a is not 
zero or negative or a must be positive, when real. If a is a complex quantity, then the 
condition will be R(a) > 0 where &(-) denotes the real part of (-). 


Question. How are the conditions a » 0 and f » 0 coming in a beta density with the 
parameters a and 8? 


Answer. Take for example a type-1 beta integral. For integrating from a to b, a > 0, 
b « 1theintegralis smooth, it exists and it has no difficulties, no singularities or the de- 
nominator does not become zero at any point. There can be problems when approach- 
ing 0 or 1. When approaching zero, 1 - x will behave like 1 and there is no problem with 
the second factor in the integrand. Consider the first factor x^^. In the integral it be- 
haves like a and hence a + 0. By using the same argument as in the gamma integral, 
a must be positive, if real, otherwise the real part of a must be positive. Now, change 
y =1-x or consider the y-integral for type-1 beta. Now f appears at the place of a, and 
hence £ > 0. Similar arguments can be put forward to show that a > 0, B > 0 in type2 
beta integrals also. [This is left as an exercise to the students.] 


Question. What exactly is dx the differential of x? Is it a small value of Ax? 


Answer. Some teachers may have told you that it is small increment in some vari- 
able x. There is another notation Ax for small increment in x. If it is small change, 
then it can be negative or positive or zero. 


Question. Can dx be zero or negative? 


530 — 16 Questions and answers 


Answer. Let x be an independent real variable, independent in the sense that we will 
be preassigning values to x. Let y = f(x) be a dependent variable, dependent on the 
preassigned values of x, through the function f (x). Let Ax be a small increment in x 
and let Ay be the corresponding increment in y = f (x). That is, Ay = f(x + Ax) - f(x). 
Then Ay = O if y is a constant function of x. By convention, we take Ax > 0 always ( or 
Ax not negative or zero) so that we can talk about increasing and decreasing functions. 
For Ax > 0 if Ay < 0, then the function y = f(x) is decreasing. For Ax > 0 if Ay > 0, then 
y =f (x) is an increasing function of x. Hence Ax > 0 always by convention but Ay can 
be negative, positive or zero. dx or dy by itself has no meaning. When Ax goes to zero, 
it goes to zero and it does not by itself become dx or something else. Small increments 
are denoted by Ax and Ay and not by dx and dy. Then what exactly is this dx? Consider 
theratio AY . This ratio can always be written because we have assumed that Ax » 0 and 
Ax is always positive. Then we can write the identity 


Ay i 
Ay = — AX 1 
{ΠΠ G) 
Consider Ax becoming smaller and smaller. If at any stage A attains a limit, then let 
this limit be denoted by f’ (x). At this stage when the limit is attained, the value of Ax is 
denoted by dx and the corresponding Ay is denoted by dy. Thus we have the identity 
(it is not any approximation) 


dy=f'(x)dx = f'(x)=dy_ divided by dx (ii) 


or f' (x) is a ratio of differentials. Thus, dx, being the differential associated with the 
independent variable x, this dx > 0 by convention. The corresponding dependent vari- 
able has the differential dy. This dy can be negative, positive or zero depending upon 
the nature of the function. 


Question. Which one x or y to be taken as the independent variable in the function 
2x + 3y -5=0? This is the same as x = 20 - 3y) or it is also the same as y = i6 — 2x). 
Then which is the independent variable and which is the dependent variable? 


Answer. It all depend upon whether we want to calculate x at preassigned values of 
y or vice versa. If y is preassigned and x is calculated from there then y is the indepen- 
dent variable and x is the dependent variable. If x is preassigned and y is calculated 
from the preassigned x, then x is the independent variable and y is the dependent 
variable. It will be more clear from the following case. There is a physical law pv = c 
or pressure multiplied volume is a constant under constant temperature. In the equa- 
tion, p represents pressure, v volume and c the constant. We can write the equation 
as p= c oras v= p If we want to ask the question: what will be the pressure if the 
volume is 10 cubic centimeters? Then v is preassigned and p is calculated from the 


16.1 Questions and answers on probability and random variables —— 531 


formula p = s. In this case, v is the independent variable and p is the dependent vari- 
able. If we want to calculate v at preassigned values of p, then p is the independent 
variable and v is the dependent variable. 


Question. In an implicit function f (x1, X2, ..., x) = 0 which is the dependent variable 
and which are the independent variables? 


Answer. If our aim 15 to calculate x, at preassigned values of x), ...,x;,, then x5, ... Xj, 
are independent variables and x, is the dependent variable. Then the differentials 
dx, ..., dx, are strictly positive by convention or dx, > 0, ..., dx; > 0 and dx, could be 
negative, zero or positive according to the nature of the function f. 


Question. What is the moment problem? 


Answer. The famous moment problem in physics and statistics is the following: We 
have seen that if arbitrary moments are available, then the corresponding density is 
uniquely determined through inverse Mellin transforms under some minor conditions. 
Suppose that only the integer moments are available, that is, for the single real vari- 
able case let E (x^^) for h = 0,1,2,... are all available, countably infinite number of in- 
teger moments are available. Is the density f(x) uniquely determined by these integer 
moments? This is the famous moment problem in physics and statistics. The answer 
is: not necessarily. There can be two different density functions corresponding to a 
given set of integer moments. There are several sets of sufficient conditions available 
so that a given integer moment sequence will uniquely determine a density/probabil- 
ity function. One such sufficient condition is that the non-zero part of the density is 
defined over a finite range [a,b], -co < a x x x b < co. More sets of sufficient conditions 
are available; see, for example, the book [15]. 

This is the case of a single random variable. In the multivariate case, the corre- 
sponding problem is that if all integer product moments, that is, E od --.χ]9 where 
h; = 0,1,2,...,i=1,...,k are given, will these integer product moments uniquely deter- 
mine the corresponding multivariate density/probability function? The answer is: not 
necessarily so! 


Let there be a multivariate density/probability function f (x4, ..., x) and suppose 
we wish to compute the expected value of a function of the variables, for example, 
(i) E(x), (1 Eoi). 


Question. How do we compute these types of expected values? In (i), only x, is in- 
volved but we have a multivariate density/probability function. We can compute E(x?) 
from the marginal density/probability function of x, but if we compute it by using the 
multivariate density/probability function will the two procedures give the same re- 
sults? 


532 — 16 Questions and answers 


Answer. This is a usual doubt ofthe students. By using the multivariate density/prob- 
ability function, for example, consider a continuous case [the procedure is parallel in 
the discrete case]: 


E(x}) = | tro o X)dX 


where f = [^ -f | and dX = dx, ^- ^ dx,. Since the function, for which the 
X χι--οο χχ--οο 

expected value is to be computed, contains only χι, we can integrate out the other 

variables x», ..., xy. When xX, ..., x; are integrated out from the joint density, we get 

the marginal density of the remaining variables, namely the marginal density of x;, 

denoted by f,(x,). Then the integral to be evaluated reduces to 


[9.4] 
E(x,)" = | xf, (x,)dx,. 
-οο 
Thus both the procedures give rise to the same result. (ii) Let us take for example, 
X3, ...,Xj to be discrete. In this case, 


> des Σ f... xy) = Πρ{Χι»ΧΧ) 


X3=—00 χι-οο 


where fj5, X2) is the marginal probability or density function of x,,x,. Now 


foe) οο 
Ε(ΧΙΧ2) = $ 2: Χ]Χ}Γ2ίΧι»Χ) 


χιξ--οο χΞ--οο 


if x, and x, are both discrete. If they are continuous, then integrate out both, if one is 
discrete and the other continuous then integrate out the continuous one and sum up 
the discrete one. Thus one can compute expected values of a function of r, r < k of the 
original k variables then those expected values can be computed either from the joint 
density/probability function of all the k variables or from the marginal density/prob- 
ability function of the r variables. 


Question. Can p (correlation between real scalar random variables x and y) measure 
relationship between x and y? 


Answer. The answer is no; it cannot except at the boundary points. In general, we can 


show that —1 < p < 1. This is easily proved by considering two variables u = = + Z and 
X y 


v= Z - Z for o, #0, 0, #0 (non-degenerate cases). Now, take the variances of u and 
x y 
v and use the fact that for any real random variables u and v, Var(u) = 0, Var(v) > 0. 
But 
Var(u) = = n Var(y) " 5 Cov, y) 


2 
02 ds 0,0, 


=1+1+2p=2(1+p). 


16.1 Questions and answers on probability and random variables —— 533 


Hence Var(u) > 0 2 2(1+p) > 0 = p= -1. Similarly, Var(v) = 2(1- p) 2 0 2 p x 1. There- 
fore, 


From the Cauchy-Schwarz inequality, it follows that p = +1 or the boundary values 
if and only y = ax + b, where a 4 0, b are constants, almost surely. Because of this 
property for boundary values, people are tempted to interpret p as measuring linear 
relationship which means if p is near to +1 or —1, then near linearity is there and if p 
is zero then no linearity is there, etc. We will show that this interpretation is also in- 
valid. Some misuses go to the extent that some applied statisticians interpret positive 
values, p > 0, as “increasing values of x go with increasing values of y" or “decreasing 
values of x go with decreasing values of y", and p « 0 means “increasing values of x 
go with decreasing values of y and vice versa. We will show that this interpretation 
is also invalid. We will show that no value of p in the open interval -1 « p « 1 can be 
given any meaningful interpretation, and no interpretation can be given as measuring 
relationships between x and y. Take, for example, y = a+ bx + cx’,c #0 and compute 
the correlation between x and y. For convenience, take a symmetric variable x so that 
all odd moments about the origin will be zeros. For example, take a standard normal 
variable. Then the correlation coefficient p will be a function of b and c only. There are 
infinitely many choices for b and c. By selecting b and c appropriately, all the claims 
about p can be nullified except for the case when p = +1 or p = -1. This is left as an 
exercise to the student. 


Question. Do the mgf (moment generating function) of type-1 beta density and the 
corresponding type-1 Dirichlet density exist, because these are not seen in books? 


Answer. For type-1 beta and type-1 Dirichlet, moment generating function (mgf) ex- 
ist. For type2 beta and type2 Dirichlet mgf do not exist but characteristic functions, 
Ε(αἵ χ), i= V-1, exist. For the type-1 beta, the mgf, denoted by M, (t) where t is an ar- 
bitrary parameter, is given by 


ΠΝ |e xy 1(1 — κ) Ταχ. 


Here, e can be expanded since term by term integration will be valid and the resulting 
series is going to be uniformly convergent and the integral is also convergent. 


_ I(a«p) χατκ- 1 p-1 
M07 qa nul? a e 


ο T(a& B) © 5 tk T(a + K)T(B) 


Τίωτ(β) Κ- Κι Tae Bk) R(a +k) 20, ΚΞ0,1,2..... 


534 —— 16 Questions and answers 


But 
T(at+k)=(a),T(a), I(a-pB-k)-r(a«p)a-p) 
where, for example, (a), is the Pochhammer symbol, given by 


(2,7a(a*1)-(a*k-1, a#0, (Φος1. 


Hence 
vU (uo M: 
M,(t) = 2» dap F(a; a + B; t) 


where ‚F; is the confluent hypergeometric series which is convergent for all finite t. 
Since it is a hypergeometric series, many books avoid the discussion of the mgf here. 
By using the same procedure, one can evaluate the mgf in the type-1 Dirichlet case, 
namely, 


Μα, (t ΠῚ t) = E[et%t tt] 
- Τά ++ + aky) 

Tay) + T(2,4) 

x (1-x, - .. - x, Td A -+ A dx. 


| efit te ym xd 
Q 


When we expand the exponential part, we can write 


co 99 gh... µε 
ehXit th = 1 koi... ylk, 
ΤΙ "T 1 k 
n-0  rj20'1 k* 
Then 
αιτίι-1 Qytry-1 ESI 
[x X (1-54 =- — Xx) keel dx; A ++ A dX 


ο ται +r) Ta, +r) 
Γίαι) I (ay) 
3 Γίαι Tied αι) 
TQ, -- --: ται rr OU) 


for R(a; +r;) > 0, rj =0,1,2,...;j=1,...,k. Now, the uniformly convergent multiple se- 
ries above will sum up to a Lauricella function of the type Fp; details may be seen 
from [3]. 

Let us see what happens in the type-2 beta case if we expand the exponential part 
and try to integrate term by term. Then the integral to be evaluated is the following: 


, —R(a)<k< RB). 


5 αμ -(a+B) qy — T(a+k) I(B - k) 
I χ (14 x) dx Ta) r$ 


16.1 Questions and answers on probability and random variables —— 535 


Note that k = 0,1,2,... but -R (a) and R(f) are fixed quantities, and hence the condi- 
tion for the existence will be violated from some stage onwards. This means that the 
integral is not convergent or expansion of the exponential part and term by term inte- 
gration is not a valid procedure here. The moment generating function does not exist 
for type-2 beta and type-2 Dirichlet cases. 

Students have difficulty in evaluating the density f(x) from arbitrary moments by 
using inverse Mellin transforms. If (s) is E (x571) for a positive continuous real random 
variable x with density f(x) and if $(s) exists for a complex s and ¢(s) is analytic in a 
strip in the complex plane then the inverse Mellin transform is given by 

1 c+ico 
fos | p(s) Sds, i= vI. 
27ti Jc-ico 
Detailed conditions for the existence of Mellin and inverse Mellin transforms may be 
seen from [2]. We will work out the inverse Mellin transform for a known special case 
here so that the procedure will be clear to the students. 


Example 16.1. If l'(s) is the Mellin transform of some function f(x) for x > 0, then 
evaluate f (x) by using the formula for the inverse Mellin transform. 


Solution 16.1. From the integral representation of a gamma function, we know that 
foe) 
Γ(5) = | xsle*dx, R(s)>0. (a) 
0 


Thus we know the function f (x) as e~ from this representation. But we want to recover 
f(x) from the inverse Mellin transform. The function to be recovered is 


1 c*ico 

feo--— | I(s)x *ds. (b) 
27ti Jc—ioo 

Since T(s) has poles at s = 0,—1, —2, ..., then the infinite semicircle c — ico to c +ico can 

enclose all these poles if we take any c > 0, for example, c = 0.5 or 3.8 etc. From residue 

calculus, f (x) in (b) is available as the sum of the residues at the poles of the integrand 

in (b). The residue at s = —v, denoted by R,, is given by 


R,- lim (s - VI(s)x 5. 


We cannot substitute s = —v and evaluate the limit. We will introduce a few more fac- 
tors in the numerator and denominator so that the expression becomes simpler. 


Rein (s+v)(Stv—1)-- 
s>-v (5εν-Ί1)::.6 
I(s+v+1) ος TOX (-1'x' 
sv Epes coL wk 


I(s)x 3 


536 —— 16 Questions and answers 


Then 


co co 4 νχν 7 
oe ) =e*= f(x). 
Thus f(x) is recovered through the inverse Mellin transform. 


Question. Are there other multivariate generalizations of type-1 and type-2 beta den- 
sities, other than type-1 and type-2 Dirichlet densities? 


Answer. Yes, there are other multivariate models available in the literature, for exam- 
ple, see [10]. Now the question comes: for a given density, such as exponential density 
or normal density, is there anything called the unique multivariate analogue? Students 
are used to the phrase “the multivariate normal density". Is it a unique density corre- 
sponding to the univariate normal density? 


Question. In general, given a univariate density/probability function, is there any- 
thing called the unique multivariate analogue? 


Answer. There is no unique analogue. There can be different types of bivariate or 
multivariate densities where the marginal densities are the given densities. This is ev- 
idently obvious but some students have the feeling that the multivariate models are 
unique. 


Question. Is a “multivariable or multivariate" distribution the same as “vector- 
variate" distribution? 


Answer. Some authors use multivariate case and vector variable case as one and the 
same. This is not so. There is a clear distinction between “multivariate” and “vector- 
variate" cases. In the multivariate case, there is no order in which the variables enter 
into the model or variables could be interchanged in the model with the correspond- 
ing changes in the parameters, if any. A vector variable case is a multivariate case, and 
in addition, the order in which the elements appear also enters into the model. If we 
have a matrix-variate case and if we are looking at the marginal joint density of a par- 
ticular row of the matrix, say for example, the first row of the matrix then we have a 
vector variable case. If we have a function f (x1, X2) = c104 4x5), 0 <x, €1,0 <x, x1and 
zero elsewhere, where c, is the normalizing constant, then we can take the variables 
as (,,X5) or as (x5, x;) and both will produce the same function in the same square. 
Suppose that our function is c;(x, + x) but defined in the triangle 0 < x, < x; < 1 then 
(Χι» x5) is different from (Χ2,Χι). They cannot be freely interchanged. The ordered set 
0, X) is different from the ordered set (5, x). Students must make a distinction be- 
tween “multivariate case" and “vector variable case". In the latter case, the order in 
which the variables appear also enters into the model. 


16.1 Questions and answers on probability and random variables —— 537 


Question. How is Pearson's X? statistic coming from a multinomial probability law? 


Answer. The (mgf) in the multinomial case is the following: 
E[eftra* 6-994] 


n n "i 

& txt +h Xp i αι Xk 

= YS emestana _ pit... pk 
χι Xg! 


χι) Xg! 


1 
E Σ vd $ — τι)" KS πο py 
(p 


ie! Tee t Dye + py)" = M(t, vy t1) (i) 


from a multinomial expansion. We can differentiate this multivariate mgf to obtain 
integer moments. For example, denoting T = O > t, =0,...,t,_, =0, 


M 

eM = E(x), j=1...,k-1 and E(x,) =n-E(x,)----- Ε(Χκ 4) 

ot; !r-o 

oM z 

mS = npj(p,e^ +--+ + py e^ + py)" ... πρι 

ot; T-O 

forjcL2...,k-1andE(x)-n-np,--.-np,,-n(1-p4--- - py 4) = npy. 

oM 
---- -n(n-Dp;p, i#j=1,...,.k-1=E(x,x;), izj. 
ΕΤ (π-Όρφ» i*j (qx), i£j 
ΟΜ 
—| -2nn-12pi-np;-E(x). 
οί Ίτ-ο 


Hence 
Var(x) = E(x?) - [E] = n(n - Dp] + np; - ^p? = πρι - pj), 
forj -1,...,k- 1. 
Cov(x;,x;) = Ε(χιχ) - Ε(χ)Ε(αχῃ) = n(n - Dpip; - (np;)(np;) = -npipj, 
fori#j=1,...,k-1. 
Cov(x;, xy) = Cov(x;,n -- xi — +++ — Xy 4) 

= — Cov(x;, x4) — +++ — Cov(xi xy. 4) 

-npp, t: + Pia + Bisa to + Pk) = np - pj) 

= np;(1— py - 1) = -npipy. 


Var(x,) = Var(n — x — +++ — Xy. 4) = Vara + -:: 1- Χκ.1) 
k-1 k-1 
= Σ Var(x;) + 2 Y Cov(x;;X;) 
i=l iji 
kA 


= np,(1— py) +- *npya(1- p) -2 5. πρι. 


i<j=1 


538 ---- 16 Questions and answers 


But 
k-1 
-25 np,p; = -np,(1- py - pj) - ++ - npy 40 - pa - py) 
i<j=1 
= -n(1 - py) + n(p] +- +p?) + np, - pj). 
Hence 
Var(x,) = n(p, +- + py.) πρι... pr a) -nü - pj) 
+ n(pt + ++ + Dea) cnp - Py) 
= np, (1 - py). 
Hence 
Var(x;)=np,;(1-p;), i=1,2,...,k 
and 


Cov(x,x)--npi,  i#j=1,2,...,k. (ii) 


The k x k matrix of variances and covariances is given by 


np, (1—p,) -npp E -npıPk 
V, = —NP 7); προ(1-. 02) .. —npopx 
1^- " à f 
-πρχρι -πρρ; | npyQ-py) 


This V, is a singular matrix with determinant of V}, denoted by |V;|, is zero or |V;| = 0. 
The non-singular covariance matrix in the multinomial case is given by 


np;(d - pi) τρ; κεν -Ἴριρκ.ι 
γ.] TPP, 0 np-p) .. PP 
-πρκι πρι ... Πρκιίί-Ρκτ) 


The most important quantity associated with V and V; is Pearson's “goodness-of-fit” 
statistic X?. The statistic, denoted by X?, is given by 


k _— H 2 
yay ea (iii) 
ii "Pi 


This can be shown to be the square of a generalized distance between the observed 
vector O and the expected vector E, where 


O= , E- np 2 . (iv) 


16.2 Questions and answers on model building —— 539 


The ordinary Euclidean distance between O and E is [(0 — E)' (O - E)] E . A generalized 
distance between O and E is available by scaling with the inverse of the square root 
of the covariance matrix V. That is the Euclidean distance between V-? O and V? E. 
That is, 


[(V-30- V-3E) (V-30 - V-3E)]? = [(0 - E V- (0 - E]. (v) 


Hence the square of this generalized distance is (O — E) V-!(O — E). This quantity 
should be approximately a chi-square with k — 1 degrees of freedom according to the 
central limit theorem, see the note below which explains why this is chi-square dis- 
tributed. Note that we can write V as follows: 


p O0 .. O0 
V -nD[I -JJ/D| -nD[D -J]']D, D= : E i oes : . (v) 
Oo O .. Dey 1 
It can be shown that 
1 
Υ =n! p^ + Z] (vii) 
k 


by inverting D^! — JJ! with the help of elementary transformations. Let (O — E)! = (n, - 
np, ..., y 4 — πρι 4). Then 


117-1 $ p! / 1’ 
(0 — E) V-(0 — Ε)-(0- E) ( yo E) + (0 — E) ( yo E) 
n πρι 
where 
z k-1 n:—np:)2 
(0-£)'(=—)(o-#) = » (nj -npj)^ 
us ji nj 
and 
E 2 
L-a ope E, 
npk npk 
Therefore, adding the above two terms we have 
Kee nawa 
(O-E)'V-\(O-E) = Y dep =X? (Pearson's X?) > y? ,. (viii) 


ji MP 


16.2 Questions and answers on model building 


Question. Is regression the same as least square analysis? 


540 — 16 Questions and answers 


Answer. Definitely not. In a large number of books, regression is interpreted as least 
square estimation and often start to define regression as least square estimation of 
model building. Least square estimation is a procedure of model building, introduced 
by Gauss. The basic principle there is to minimize the square of the Euclidean dis- 
tance between the observed value and the value estimated by the model and then fit 
the model in hand to the data at hand. If someone wishes to fit a second degree poly- 
nomial y = a + bx + cx’, c + 0 to paired observations (x; y), (5. y), .... (X4, y), then 
corresponding to x;, an observation on x, there is an observed ale of » denoted 
by y;. Naturally, y; need not be equal to a + bx; + ex; unless y = a + bx + cx? is a math- 
ematical relationship or the model selected is a perfect fit for all possible pairs of ob- 
servations (xj, y;). In reality, the selected model y = a + bx + cx? is taken as a possible 
behavior of di data in hand. Hence, naturally, there is an error which may be taken 
as e; = yj - (a + bx; + CXF), the difference between the observed and modeled value of y. 

One question the students ask is: can we take error as (a+ bx; + cx?) -yj76j instead 
oftaking the other way around? The answer is *yes" because we are going to consider 
only the distance and hence both will lead to the same answers at the end. 

Here, the unknown quantities are a, b, c because an arbitrary second degree poly- 
nomial is selected to fit the data in hand. The error sum of squares is then 


This is the squared Euclidean distance between the observed y; and the modeled 
a + bx + ex. If a, b, c are estimated by minimizing the error sum of squares then the 
method is the method of least squares or the sum of squares of the error is made a 
minimum. Minimization can be done in many ways. If calculus is used, then consider 
the equations: 


E n n A E , i 

312,8 50 = 6j-à-bx -eéx)-0 (i) 
j=l j=l 

x d " 

m 650 > Σα jyj - ax; - bx? - ἔχ)) -0 (ii) 


= 
Il 

ων 
=. 
Il 

ων 


g|e 
Ms 
D 
I 
eo 
Ju 


λα y; — ax? - bx? --ἐχ[) - 0. (iii) 


= 
Il 

ων 
=. 
ll 

ων 


Note that (1), (ii), (iii) do not hold for all possible values of a, b, c but they hold only at 
a point or points where (à, b, ὃ) satisfies all the equations (1), (ii), (iii). A solution of (i), 
(ii), (iii) for (a, b, c) is denoted as (à, b, ὃ). Note that in (i), (ii), (iii) all xj's and y,’s are 
data or known numbers and the only unknown quantities are â, b, C. 


Question. Should we have used a, b, c in equations (i), (ii), (iii), usually that is done 
in many books, rather than putting hat for a, b, c in these equations? 


16.2 Questions and answers on model building —— 541 


Answer. If we write the equations (i), (ii), (iii) without hat, then it means that the 
equations hold for all possible values of a, b, c, which is incorrect, and hence a proper 
way of writing the equations is with the hat for the unknown quantities a, b,c to indi- 
cate specific values for which the equations hold. 

A solution of (i), (ii), (iii) is called a critical point and the equations (i), (ii), (iii) are 
called normal equations. 


Question. Why are they called normal equations, is there any normal distribution 
involved? 


Answer. As far as this author knows, there is no normality involved and someone 
called the minimizing equations in least square procedure as normal equations and 
from then on they are called normal equations. The points corresponding to a local 
maximum or local minimum or saddle point is called a critical point when a calculus 
procedure is used. 

In a general situation where one has a model y = g(x,, ...,x,,4,, ...,a,) to be fitted 
to a given data where y and (x,, ...,x,) are going to be observed, and hence they will 
be numbers eventually and the only parameters or the unknown quantities will be 
αι... αχ. In our g, there is g(x, a, b, c) is a + bx + cx? or y =a + bx + cx’. In the general 
case, the normal equations are the following: 


0a; ¢ 9 


CNET: c ag 
Σεὶ- Siy εαν. χραν- ))] = 0, j=1,...,k. 
j j=l 4j j-1 dj 


Ifthe unknowns a4, ..., a, are such that g is a linear function in the unknowns a4, ... αχ 
then the procedure is called linear least square procedure. Note that y, x4,...,x, are 
going to be observed and hence they will be numbers eventually, and hence if g is a 
linear or non-linear function of x,, ...,x, itis still a linear least square problem as long 
as g is linear in q}, ..., ay. In our model, the function g = a + bx + cx? which is linear 
in a, b, c but non-linear in x but the model is a linear model because it is linear in the 
unknowns a, b, c. Students ask several questions in this connection. 


Question. In least square analysis, usually we compute the critical points from the 
normal equation and construct the model. Why are we not computing the second- 
order derivatives and the matrix of second-order derivatives evaluated at a critical 
point to check for maxima/minima and instead of doing this why do we stop at critical 
points and claim that we have minimized the sum of squares? 


Answer. Since the parameters a4, ...,a, are arbitrary, we can set them arbitrarily 
large. For example, in our example we can set a, b,c arbitrarily large and, therefore, 
naturally, the maximum of ΣΉ ej is at +oo, and hence the critical point, usually 
unique, will correspond to a minimum. 


542 —— 16 Questions and answers 


Question. In least square and other model building situations, why are x,,...,x, 
called independent variables and y the dependent variable? What are they indepen- 
dent of? 


Answer. Independent is a little unfortunate term. What it means is that we as- 
sign values to x, ... x, or at given values of x,,...,x, we wish to estimate y. In this 
sense, x,,...,x, are called independent variables and y the dependent variable. In 
y -2x - 32 0, which one is independent variable and which is dependent variable 
because we can write this equation as y = 2x + 3orasx- iy — 3). Independent and 
dependent variables do not depend on how we write the equation. It depends upon 
how are we going to use the equation. If we are going to evaluate y at given value of x, 
then in this case x is an independent variable and y is the dependent variable or if we 
are going to evaluate x at given y then in this case y, is the independent variable and 
x is the dependent variable. In the general model, y = g(x,,...,x,,4,,...,a,) we are 
going to preassign values to x,, ..., x, and observe the corresponding value of y. Hence 
X,, ..., X, Will be the independent variables here and y is the only dependent variable. 


Question. In almost all books on model building, a simple example is given claiming 
how to convert a non-linear model into a linear model by taking the model example as 
y = ab", then take logarithms and write as Y = A+ Bx where Y =Iny, A = InA, B- Inb. 
Can we convert non-linear models to linear models by such a procedure? 


Answer. Unfortunately, the above procedure is incorrect. For taking logarithms, 
y;j's must be positive since we are dealing with real numbers. Suppose that y;'s are 
positive, a, b are assumed to be positive. Still the procedure is incorrect. What happens 
to the error? In this case, y; = ab% + ej, j = 1,...,n. We cannot write the logarithm of 
the sum on the right side and write as sum ofthe logarithms. Suppose that the error is 
entering into the model as a product such as y; = ab5e; then can we take logarithms 
and proceed? The meaning of error is that it can be a positive or negative quantity. If 
one can guarantee that in certain problems the error is always a positive number and 
the error enters into the model as a product and the model is of the form ab*, then 
one can take logarithms. This is not a usual practical situation. The natural thing to 
do in a model such as y = ab* is to apply non-linear least square analysis. 


Question. Then what is regression, if not least square analysis? 


Answer. Regression is a prediction problem and it is not an estimation problem. 
We are trying to predict one variable, say y, by using other variables such as x or 
X, ..., Xp, using in the sense that what is the predicted value of y at preassigned values 
of x,, ...,x,? What is the “best” predictor function of x, “best” in some sense? We can 
use any arbitrary function $(x), if there is only one variable x, to be used to predict, 
as a predictor function. But the predicted value $(x) for y and the true value of y may 
be far apart. We would like to have @(x) agreeing with y whatever be the value of y. 


16.2 Questions and answers on model building —— 543 


This is not possible in a real-life situation unless there is a physical law behind it so 
that there is a mathematical relationship between x and y. In the absence of a mathe- 
matical relationship, the best thing to do is to minimize a distance between y and $(x) 
and select a $. Then this $, which minimizes a “distance” between y and $(x), can be 
called the *best predictor". We can construct various measures of distance between y 
andan arbitrary predictor function $(x). A convenient squared distance is Ely - (x)|? 
where E denotes the expected value. @(x) at given x will be a constant, say a. Then the 
question is what is a such that E|y — a|? is a minimum? We already know the answer 
to this. It is the expected value of y at that preassigned value of x or it is E(y|x) Ξ the 
conditional expectation of y at preassigned value of x. Hence this conditional expec- 
tation is defined as the “best” predictor, “best” in the minimum mean square sense or 
the mean value or expected value of the squared Euclidean distance is minimized. 


Definition (Regression of y on x). It is defined as E(y|x) or the conditional expecta- 
tion of y at preassigned value of x, if only one variable x is used, and it is E(y|x,, ...,x,) 
if many variables x,, ... , x, are used for predicting y. 


The reason for defining it like this is explained above, that itis the *best" predictor 
of y “best” in the minimum mean square sense. Students usually ask the following 
question: 


Question. Regress means to go back. Are we going back to something or why the word 
regression is used for the best predictor? 


Answer. By this process, we are not going back to anything. It is simply the best pre- 
dictor, best in the minimum mean square sense. But originally the problem was to 
study characteristics of offsprings and to say something about the parents or previous 
generation. Thus the original problem was a problem of going back. Nowadays, we 
are not using it for going back but using it to come with the best predictor whatever be 
the situation. 


Question. Do we need to know the joint distribution of x;, ... , xy in order to construct 
the best predictor E(x; x5, ... x4)? 


Answer. If the joint distribution is known and if the conditional expectation ex- 
ists, then we can compute E(x; x5, ... , xy) or if the conditional distribution of χι, given 
X5, ..., Xj, is known then also we can construct Ε(χι|χ»,..., Χχ) provided it exists. Hence, 
we should know at least the conditional distribution, if not the joint distribution. 


Example 16.2. From the joint density, 
1 1 272 
fü.y-——e3709-** ^ o9«y«oo, 0sx&2 
2N27 


and zero elsewhere, compute the best predictor of y at given values of x. 


544 — 16 Questions and answers 


Solution 16.2. In order to get the marginal density of x, integrate out y. But 
|... ee VW ἂν =1, 4 =1+ 2x + 3x’. Hence the marginal density of x is f,(x) = 2, 
O < x x 2 and zero elsewhere. Dividing f(x,y) by this f(x) we get the conditional 


density of y, given x, as 
I... 2 
£O = tt 10-9, u=1+2x+3x. 


Hence from this conditional normal density, E(y|x) = 1 + 2x + 3x? which is the best 
predictor of y at given values of x in this case. Note that here it is a non-linear function 
of x. 


Question. It is said that linear regression uniquely determines a normal or Gaussian 
density. In the above example, we get a non-linear function of x, namely 1 + 2x + 3x? 
as the regression of y given x. Is there a contradiction here? 


Answer. It is known that if y and x are jointly normally distributed then the regression 
of y on x, and x on y, are linear and the linear functions are the following: 


E(ylx) =o + Pa. - m) (16.1) 
T 


where E(x) = μι, Ε(ν) =p, Var(x) = ο, Var(y) = 02 and the correlation between x and 
yis p, and 


E(xly) =, + PU - w). (16.2) 
2 


Also, linear regression, under some additional conditions characterize joint normal- 
ity for x and y; see the books on characterizations or see, for example, [11]. In Exam- 
ple 16.2, x and y are not jointly normally distributed. There, only the conditional den- 
sity of y, given x, is normal with conditional expectation of y, given x, non-linear. We 
can have the conditional density of y normal with all sorts of conditional expected val- 
ues, linear or non-linear. For computing E(y|x), we need only the conditional density 
of y, given x, and the joint density is not necessary. Hence there is no contradiction 
here. 


Question. There is a theorem which says that E(y) = E[E(y|x)]. Is it true for all types 
of variables x and y as long as there is a conditional distribution of y, given x, and a 
marginal distribution of x, in the light of the answer to the previous question saying 
that all sorts of E(y|x) can be there? 


Answer. The theorem E(y) = E[E(y|x)] does not hold everywhere. In some examples, 
the theorem will hold and in others it need not hold. Consider the following example. 


16.2 Questions and answers on model building —— 545 


Example 16.3. Evaluate E(y), if possible, from the following joint density: 
iE 
γ2π 


1 
foGy)-» 5 e- 20-120". -co«y«oo, 1xx«oo 
X 


and zero elsewhere. 


Solution 16.3. Integrating out y from the joint density we get the marginal density of 
x here as 


1 
CEN 1<x <œ 


0, elsewhere. 


Here, as well as in the previous examples in this section, we cannot integrate out x to 
get an analytic expression for the density of y. But we can try to evaluate E(y) by using 
the theorem 


E(y) = E[E(ylx)]. (16.3) 
Here, E(y|x) 2 1 * x, and hence E[E(y|x)] = E(1 + x) = 1 + E(x). Note that 


01 


E(x) = MET | —dx = [lnx]? = co 
1 x 1 X 


or does not exist. Hence E(y) cannot be computed by using the above theorem. That 
theorem was valid only when all the expected values existed. 

If the joint distribution or the conditional distribution is unknown, then the re- 
gression of x, on x;,...,x, cannot be evaluated. In this case, if we have some idea 
about the conditional expectation, such as it is a linear function of the conditioned 
variables x», ... , x, or some other specific form then we go for estimation of the regres- 
sion function, where we use the method of least squares. Thus least square analysis 
comes in for estimating the regression function when the conditional distribution is 
not known. If the conditional expectation is suspected to be linear, that is, 


Ε(χι|Χχ»»...,Χµ) = By + B2X2 + +++ + BkXk (i) 


and if the conditional distribution of χι, given x, ...,x, is unknown then we set up a 
model of the type 


Xj =A, + AnXy t+ QXgre;, Jol... (ii) 


and try to estimate a,,a5, ..., ay so that the regression function can be estimated by us- 
ing the estimated values â}, à», ..., ἂχ and substituting à; for B; in (i). The observation 
matrix X4, X5, ... „Xx is of the following form: 


Xu Xa e Χα 


X X» ..:. Χο 


Xin On ..:: ΧὨ 


546 —— 16 Questions and answers 


Here, X = (Χρ). If the column averages are denoted by x; = πο ,i=1,...,k. If a, is 


eliminated from the model by subtracting the averages, then a end up with the ma- 
trix corresponding to X as 


Xi Xi ... X, 
ž=x-%, ο % X| > ικα Sc Ry 
χι x, Xk 
where S = (Sij ), Sij = Ppa Xir — Xi) (Xr - Χρ. Then we may partition S as 


Su S 

a il Sy is1x1 and 5» is(k-1)x(k-1) 
Sa 522 

Question. Why are s;'s called “corrected” sum of products in some books? Was there 

a mistake somewhere in the procedure? 


Answer. The Pu "corrected" is used to indicate that deviations from the respec- 
tive averages χι, i = 1, ..., k are taken. That is all. There is no mistake anywhere. 


Question. Can we take 52; to be non-singular? 


Answer. Ina νος type model building situation, the x;’s are preassigned quan- 
tities for i = 2,..., k andj -- 1,..., n. When we preassign v; = (c. X) for a specific j, 
we are not going to preassign sed a multiple of this vector or a linear function of 
the points already selected because no additional information will be forthcoming. 
Hence we may assume, without loss of generality, that 52) to be non-singular. If x;'s 

are coming from a design type model- then the xj's are determined by the design οἱ 
the experiment where usually X will be a less thant full rank matrix making S singular. 

Then the sample multiple correlation, denoted by ΑΙ; κ, is defined as 


_ 552 Is 
μας ο 


R2 
1.2... 
Su 5221511 


parallel to the corresponding population quantities. 


Question. Why do we take the form in (a)? Is it because the regression may be linear 
all the time? 


Answer. If the regression is known to be linear, then we have proved in Chapter 14 


γ ΣυΣΖΣ 21 


that the population multiple correlation οοθβιοἰθηίρι2 κ = oz is the maximum 
correlation in the class of linear predictors for χι, based on x), ... ,x;,. Ifthe regression is 
known to be non-linear then we have a corresponding measure of multiple correlation 
ratio and then that can be used. If nothing is known, whether the regression is linear 


16.2 Questions and answers on model building —— 547 


or non-linear, then ϱι2 κ is a convenient quantity to calculate, and hence p,, κ and 
the corresponding sample value R; ; κ, given in (a), are used in practical situations. 


Question. Is the sample multiple correlation coefficient a good measure to use to 
check the *goodness" of a model, the procedure that is usually done in practice? 


Answer. No. It is not an indicator of the ‘goodness” of the model in most practical 
situations. In Section 14.6.2, it is seen that the multiple correlation coefficient keeps 
on increasing if more and more variables are included, the variables themselves may 
not have any relevance in estimating χι. If variables x,,x>,...,x, are jointly normally 
distributed, then one can check whether the sample multiple correlation Ri, κ is “sig- 
nificantly large" or not. In a practical situation, joint normality may not be there and 
it is difficult to check for joint normality also. This author suggests to use s? - the least 
square minimum, to check for the “goodness” of the model. s? is a squared distance 
between the observed χι; and the estimated xj, estimated by the model. Hence the cri- 
terion *smaller the distance better the model" can be used. At each stage compute s? 
and if s? increases or remains steady, then remove the new variable introduced. If s? 
decreases, then proceed and stop the process when the decrease in s? is insignificantly 
small. 


Question. Usually, the hypothesis Ηρ : c; = 0 is tested by using a Student-t test to 
delete or retain c; in the model. Is it a proper procedure? 


Answer. A Student-t statistic arises from the normality assumption, that is, the as- 
sumption that χι, at given x», ... xy, is conditionally normal or the conditional density 
is a normal density. There are many characterization theorems available to character- 
ize or uniquely determine a normal distribution. In the light of these, one can analyze 
the practical situation at hand and decide whether a normality assumption is reason- 
able. If reasonable, then a Student-t test can be used. But s?, the lest square minimum, 
avoids all such distributional problems. 

If we are estimating x, by using x, assuming a simple linear model, then the model 
will be of the form 


Xy=Co t+ CoXy +e; j=1,...,n 


and the estimated linear function is of the following form: 


r 


; , 512 3 x ΕΠ : 

Xi-X4* (κ) - χο) > ox-X- (x5 - X9 
922 522 

where r represents the sample correlation coefficient. 


Question. Suppose that we wish to test a hypothesis such as c, = 0 or construct a 
confidence interval for ο). Is it equivalent to testing the hypothesis p = 0 the corre- 


548 — 16 Questions and answers 


sponding population correlation coefficient, or constructing confidence interval for p 
and then converting it for c,? 


Answer. Unfortunately, the procedures are not equivalent. Decision making on the 
coefficient c; can be done in the conditional space at given values of x; whereas in- 
ference on p will require the joint distribution of x, and x;, conditional distribution is 
not sufficient. Hence the two procedures have two different premises and they are dif- 
ferent. One can be done in the conditional space but the other needs the entire space 
or joint space. 


Question. In the model building situations usually we assume the errors 6/5 to be 
normally distributed. Why assume normality? Are not the errors normally distributed 
according to Gauss? 


Answer. If errors satisfy some basic conditions such as (i) 6; is contributed by in- 
finitely many unknown factors, all independently contributing and such contribu- 
tions are infinitesimally small; (ii) contribution of each such factor can be positive 
or negative with probability 3 each; (iii) the total variance, Var(e;) is a finite quan- 
tity o?. Under these conditions, it can be mathematically derived that the error will 
be normally distributed with expected value zero and variance o? or in such a situa- 
tion e-«N (0,0?), j=1,...,n and independently distributed. Hence normality is a rea- 
sonable assumption. But there are situations where the errors may not follow normal 
distributions. 


Question. If we know beforehand that y has a symmetric distribution so that E(y) = 0, 
then should we take the model as y; = cixjj + +++ + C,Xqy + €j, j =1,...,n or including co 
also? 


Answer. When we preassign x, = 0,...,x; = 0 if all observations on y are zeros, then 
we can take the model with cy = 0. If that does not happen, usually in a practical 
situation this does not happen, then take the model with c; in and eliminate c, by the 


procedure described below and continue with the analysis. The error sum of squares 


; n 2 ; ð n 2x : 
is Èj ej. Then the equation 06 24-4] = 0 gives 


Co =9- 6X1- — Xp, = Xp = ἱἰ-Ί...,Κ. 
If we substitute for c9, then we have the model 
yj-y - c6 -X)*:- + Ct - Xy) ΕΘ; J= lason: 
Then the normal equations become the following: 


(X - X) (X -)B - (X - X (Y - Y) (16.4) 


16.3 Questions and answers on tests of hypotheses ---- 549 


where 
y X X; χι Cy 
es eae xe j i p i: , B- êz , X-X isnxk. 
y P ME A 


Then, under non-singularity for (X — X)' (X - X) we have the solution: 
B-[x-X'ax-]'a-xyw-v) (16.5) 


and the least square minimum 


siz(y-Yyü-Q-X[a-X'a-]'« -2'y - v 


with E(s?) = (n — k)o? and under normality 5 ^ X2 κ. and an unbiased estimator for o? 
is s or δ) = E = unbiased estimator for o?. Use this δ’ for constructing Student-t 
for testing hypotheses on individual parameters c}, ..., c,. In this case, the degrees of 
freedom is n - k, not n - (k 4 1). 


16.3 Questions and answers on tests of hypotheses 


Question. Why Ηρ is called the “null” hypothesis? Is there anything made empty or 
zero by this hypothesis? 


Answer. Nothing is made zero or empty. It is simply the hypothesis being tested. The 
word “null” came due to historical reasons. Testing was originally developed for agri- 
cultural experiments, where the experimenter liked to make a claim that the expected 
yield under a particular fertilizer was different from that of another fertilizer, that is, 
a hypothesis of the type μι + 4; where μι and p, are the expected values. But open 
statements cannot be tested, which will be seen later, and hence the hypothesis is 
made as Ηρ : μι = p, and tested against H; : μι + p2. Here, the negation is tested to say 
something about the hypothesis of interest. Hence “πο difference" brought in the term 
“null”. Nowadays there is no such meaning. Ηρ simply means the hypothesis that is 
being tested. 


Question. In many books for testing a hypothesis such as 4; > 4, itis suggested to take 
Ho: µιΞ pj and test it against μι > µ2. Can we take alternate as we please or according 
to convenience? 


Answer. No. The procedure is incorrect unless it is guaranteed that either μι = µ; 
Οἵ μι > Wy and nothing else is possible, which is not a practical situation. When 


550 —— 16 Questions and answers 


Ho : My =}, then the natural alternate is H, : μι + Mo. But if someone wishes to 
make a claim μι > Wy, then take Ηρ : μι < Wy and test it against the natural alter- 
nate H; :μι > Uo. This is a valid procedure. The null and alternate hypotheses must 
cover the entire parameter space. Hence one cannot select alternate according to 
convenience. Some times, a wrong procedure can lead to the same test criterion ob- 
tained through the correct procedure. This will be explained when we talk about test 
criteria. 


Question. Is there a practical situation where both Ηρ and H, are simple? 


Answer. Usually not, but there can be situations where there are only two points in 
the parameter space. Consider a machine automatically filling 10 kg sugar bags. The 
expected weight of each bag is 10 but there can be slight variations from bag to bag 
because the machine is not counting sugar crystals, probably the machine is timing 
it or automatic weighing process is there. Suppose that a machine is automatically 
filling 20 kg sacks with coconuts, where the machine is not allowed to cut or chop any 
coconut. Then the variation from bag to bag will be substantial up to the weight of one 
coconut. In both of these examples, at one case E(x) = 10 = expected weight and in the 
other case E(x) Ξ 20. Suppose that the sugar filling machine went out of control for a 
few minutes so that the setting went to 9.5 instead of the 10 before it was recognized 
and corrected, or a dishonest wholesaler set the machine purposely at 9.5 for some 
time. All bags are sent to the market. There are only two possibilities here, either the 
expected value is 10 or the expected value is 9.5 and nothing else or the parameter 
space has only two points 10 and 9.5, and hence a simple Ηρ versus a simple H; is also 
possible or makes sense here. 


Question. In many books, it is written “reject Ho, accept Ho”. Is it non-rejection equiv- 
alent to accepting the hypothesis? 


Answer. You will see later that the whole testing procedure is geared to rejecting Hp. 
If Ho is not rejected, then nothing can be stated logically. Testing is carried out by us- 
ing one data set. If Ηρ is not rejected in one data set, there is no guarantee that Hp is 
not rejected in all data sets. We have several real-life examples. The drug thalidomide 
was tested on mice, rabbits and several types of animals and the hypothesis that the 
drug was good or effective and safe was not rejected. The drug was introduced into the 
market and resulted in a large number of deformed babies and the drug was banned 
eventually. If Hp is rejected in one data set, we are safe and justified in rejecting Hg 
because it is rejected at least in one data set. Also, these are not mathematical state- 
ments. When we reject Ηρ we are not saying that Ηρ is not really true. It says only that 
as per the testing procedure used and as per the data in hand, Ηρ is to be rejected. Log- 
ical way of formulating the decision is “reject Ηρ” and “not reject Ηρ”. The question 
of acceptance does not arise anywhere in the testing procedure. 


16.3 Questions and answers on tests of hypotheses ---- 551 


Question. In many books, it is stated “independent observations are taken". Obser- 
vations are numbers and how can we associate statistical independence to numbers? 


Answer. Asimple random sample or often called a sample, is a set of independently 
and identically distributed random variables or (iid) variables, not numbers. Suppose 
that {x,,...,xX,} be the sample of size n, where x, ..., x, are iid variables and not num- 
bers. If we take one observation on x;, one observation on x3, .... one observation on 
x, then we say we have n independent observations. "Independent observations" is 
used in this sense. 


Question. Is the likelihood ratio test uniformly most powerful test (UMPT) in the light 
of the Neyman- Pearson lemma? 


Answer. The Neyman- Pearson lemma is applicable in the case of simple Ηρ versus 
simple H;, which can be stretched to simple Ηρ versus composite Ηι, as illustrated in 
the worked example. But for composite Ηρ versus composite Η there is no guarantee 
that we get the UMPT. In most of the problems that we consider in this book, we have 
UMPT. 


Question. Is maximizing L, the likelihood function, the same as maximizing ln L and 
vice versa? 


Answer. Yes. As long as $(L) is a one to one function of L, then 2L =0> 2o) ΞΟ. 


Question. Instead of differentiation with respect to σ’, as was done in the Gaussian 
case, if we had differentiated with respect to ø do we get the same estimate for o? as s?? 


Answer. Yes. We would have got the same answer. The reason being 


[]=0 


9 

9σ 
or both will lead to the same solution. In fact, for any non-trivial function ψ(σ) for 
which a + 0 the maximum likelihood estimate (MLE) of ψ(σ) is ψ(δ) where ô is the 
MLE of c. 


Question. In the maximum likelihood procedure, should we not have taken the 
second-order derivatives and checked for the definiteness of the matrix of second- 
order derivatives evaluated at the critical points to check for maxima/minima? Why 
stop at finding the critical points and claiming that the maximum occurs at the critical 
point? 


Answer. Yes. We should have taken the second-order derivatives and checked for the 
definiteness of the matrix of second-order derivatives. Here (Gaussian case), itis easily 
seen that the matrix of second-order derivatives is negative definite: 


552 ---- 16 Questions and answers 


9 n, 
git ο 
2 
- InL= E = 5 at the critical point 
ð ὃ n 
InL=-"~«% 2 =0, 8-g? 
96 ðu e? δῇ 
o? n 1 n 
ο plo ni] =- 
αμ e HM eR 


Therefore, the matrix of second-order derivatives, evaluated at θ, his given by 


n 

E al 
n 

E. 


which is negative definite. Hence the critical point corresponds to a maximum. We can 
also note this by observing the behavior 1n L for all possible values of u and σ’. You 
will see that In L goes from --οο back to -co through finite values, and hence the only 
critical point must correspond to a maximum. 


Question. Ifthe null hypothesis is Ηρ : 4 < Uo, is there a MLE under the null hypoth- 
esis? 


Answer. No. There does not exist a MLE if the hypothesis is in an open interval. Noth- 
ing can be maximized in an open interval. Hy must have a boundary point, otherwise 
the MLE does not exist. This is a very important point. For a general 6, if the claim is 
0 < 6, then formulate the null hypothesis as 0 > 0, and test it against 0 < 0. Such a 
test is possible under the likelihood ratio test because the MLE under Ω (parameter 
space) and under Ηρ are both available. 


Question. If we had taken Ho : 4 = pọ against H; : u > uo would it not give the same 
A-criterion? 


Answer. Yes. But the procedure is logically incorrect because the alternate must be 
the natural alternate. If Ηρ is 4 = Wo, then the natural alternate is H, : u + Uo. If the 
possible parameter values are known to be µρ = µ < co, then the alternate for Ηρ : w= 
Ho is H; : H > Uo. Otherwise, the procedure is logically incorrect. 


Hint. Note that the rejection region is in the direction of the alternate. Here, the al- 
ternate is H; : u > ug and we reject for large values of z or for z > z,. This, in fact, isa 
general observation. 


Question. Is it true that if x,,...,x, are jointly normally distributed then any linear 
function of x,,...,x;, is univariate normal? 


16.3 Questions and answers on tests of hypotheses —— 553 


Answer. If the multivariate normal is taken as the usual multivariate function de- 
noted by N,(j, £), Σ > 0 or Σ is at least positive semi-definite, then it can be proved 
that any linear function u of x4, ...,x,, containing at least one variable, is univariate 
normal with the parameters E(u) and Var(u). 


Question. Does non-rejection of Hy mean that the hypothesis Ηρ is accepted? 


Answer. Non-rejection does not mean acceptance. Non-rejection may be due to many 
reasons. Our assumption of joint normality may be faulty and in another data set the 
decision may be to reject. The procedure of constructing the test is geared to reject- 
ing Ηρ. We fixed the probability of rejection when Ηρ is true as a (we give that much 
probability for our decision to be wrong) and we maximized the probability of rejec- 
tion when Hp is not true. The maximum that we can say is that the data seems to be 
consistent with the hypothesis. 


Question. If we have the data in hand, then by looking at the data we can create a 
hypothesis that can never be rejected or that can always be rejected. Then what is the 
meaning of testing of hypotheses? 


Answer. If we have the data in hand, then by looking at the data we can create a 
hypothesis of our interest to reject or not to reject. This is not the idea of testing a 
statistical hypothesis. We create a hypothesis first. Then we collect data in the form of 
arandom sample on the relevant variables. Then carry out the test by using one of the 
testing procedures. This is the idea. 


Question. Consider, for example, the hypothesis Hy : u < ug in a N(j,0?) with o? 
known. Then we reject for large values of z = MIO Since an n is present in the 
numerator, can we not take n large enough so that we will not reject Ηρ at all. Is it not 


true? 


Answer. This is a question the critics of testing of statistical hypotheses raise. It is 
not quite correct. By the weak law of large numbers, x is going to the true value of u. If 
the null hypothesis is correct, then x is going to uy when n oo. Hence when n - oo, 
z is not going to co freely. When n is changing the density becomes more and more 
peaked and still the probability for large z is kept as a. Hence, even though it appears 
that z is going to infinity when n co itis not so. 


Question. In the likelihood function relating to Bernoulli population, why are we not 
taking the binomial coefficient (7)? 


Answer. We are taking a simple random sample from a Bernoulli population. Then 
the likelihood function is Į J, p5(1 - p) * and there is no binomial coefficient (3), 
If we had taken one observation from a binomial distribution then there would have 


554 — 16 Questions and answers 


been this binomial coefficient because we have derived the probability function of x, 
which has the binomial coefficient coming from the number of combinations of n 
taking x at a time, where x - the binomial random variable is really the Bernoulli 
sum. 


Question. In all the problems before, we had taken the probability statement as ex- 
actly equal to a. Why do we take it as xa, in the discrete case, why not »a? 


Answer. In all the previous problems, our population was continuous, and hence we 
were able to solve for the critical point with the right side exactly equal to a. When the 
population is discrete, individual probability masses are at distinct points. When we 
start adding up from one end the sum need not hit exactly a = 0.05, a = 0.01, etc. Up 
to a certain point, the sum may beless than a and when you add the mass at the next 
point the total may exceed a. Then we stop at the point where the sum is less than a if 
it did not hit a. Why not take bigger than a or the point when it just exceeds a? We are 
prefixing a or allowing a certain tolerance and smaller the tolerance level to go wrong 
is better. Hence we take less than or equal to a. 


Question. In using lack-of-fit or goodness-of-fit tests, if we had created a claim by 
looking at the data then we could have not rejected the hypothesis. Why not modify 
the claim? 


Answer. In any testing procedure, the hypothesis or the claim has to come first, then 
we take the data and check the claim against the data and not the other way around. 
If the claim is made by looking at the data, then we can always make the claim either 
consistent with the data (not to reject Ηρ) or reject Ηρ. Then the purpose of testing 
hypothesis will be defeated. 


Question. Does it mean that no other Poisson model is a better fit to the data [testing 
goodness-of-fit of a Poisson model]? 


Answer. No. We only fitted one Poisson model with A = 2.4. We did not exhaust all 
possible parameter values. We got 2.4 as the MLE, which as an estimator has many 
interesting properties. That is all. 


Question. Even though MLE has many interesting properties, can we find a better 
fitting Poisson model to this data? 


Answer. Usually, we will be able to find a better fitting model from the same family. 
In the present case, the answer is “yes”. Take A = 2.5. Then the X? value can be seen to 
be 2.36 < 4.04 < 9.49 which shows that the Poisson models with A = 2.4 and A = 2.5 are 
good fits to the data. Further, since X? is a measure of generalized distance between 
the observed and expected frequencies, smaller X? value is better the model. Hence 


16.3 Questions and answers on tests of hypotheses —— 555 


the Poisson model with A = 2.5 is a better Poisson model to the data than the one with 
A 2 2.4 given by the MLE. A Poisson model with A = 2.3 can also be seen to be a good fit 
(Hp is not rejected), not better than the cases for A = 2.4 and 2.5. 


Question. From the above conclusions, how good is this “goodness-of-fit” test? 


Answer. In “goodness-of-fit tests", the testing procedure defeats the purpose. As per 
the testing procedure in hypotheses testing, if Ηρ is rejected then it is a valid con- 
clusion and if Ηρ is not rejected the procedure does not help to say anything further 
or the procedure does not allow you to “accept” Ηρ. The whole purpose of going for 
*goodness-of-fit" test is to claim that the model is a good fit to the data. Hence the sta- 
tistical aspect is controversial and better to forget about the statistical part. Instead 
of relying on the chi-square critical point, one can treat Pearson's X? as the square of 
a distance between the observed and expected frequencies. Hence use the criterion: 
*smaller the distance better the model". By this process, if we want to fit a Poisson 
model to the above data and if the three models with A = 2.3, A = 2.4, A= 2.5 are com- 
pared then we will select the model with A = 2.5 because the distance there is the small- 
est among the three. One should call this class of tests “lack-of-fit” tests, that is what 
is exactly measured by the statistical procedure. 


Question. Can we come up with any other discrete distribution, other than a Poisson 
model to fit this data? 


Answer. Yes. We can come up with many other discrete models which will fit this 
data. For example, consider a multinomial model where the hypothesized probabili- 
ties are 


pı=0.1, p,-02, p,-025 p,=0.2, p,=0.15, pę=0.1 


[the observed proportions]. The distance between the observed and expected frequen- 
cies is exactly zero. Hence there cannot be a better model than this to fit this data. Con- 
sider new multinomial models with slight changes in the above probabilities so that 
the X? value in each case will not reject Ηρ. There can be infinitely many such mod- 
els which will all fit the data, even with X? value smaller than 2.36 the best among 
the three Poisson models that we have considered. Thus, many better fitting models 
can be constructed belonging to other families of distributions or may be to the same 
family. 

When Pearson's X? test or any other so-called “goodness-of-fit” test is used, the 
decision of non-rejection cannot be given too much importance. If one model is found 
to be a good fit, under some criterion, we may be able to find several other models 
which are better fits or as good as the selected model, under the same criterion. 


Question. Do we have to compute the full X? value [Pearson's X?] to make a decision? 


556 ---- 16 Questions and answers 


Answer. Asillustrated in the example in this book, it is not necessary to compute the 
full X? value. We need to check only whether the observed X? exceeds the critical point 
or not. This may be possible by computing from a few cells. 


Question. Can we use this procedure whatever be the number of cells in our classifi- 
cation? 


Answer. Νο. In order to have a good chi-square approximation for Pearson's X? statis- 
tic, the following isa rule of thumb. If there is no estimation involved, then the number 
of cells k > 5 and the expected frequencies in each cell, under Ηρ, must be 25. If es- 
timation of parameters is involved in obtaining the estimated expected frequencies 
and if the resulting degrees of freedom is v, then v — 1 > 5 and each of the estimated 
expected frequency is 25. This is a rule of thumb. If the number of cells is only 2, then 
binomial situation arises. Here, the total frequency n > 20 for a reasonable approxima- 
tion if the true probability is not close to zero or 1. Similarly, for k = 3,4 one can find 
separate conditions for a good chi-square approximation. 


Question. Rejection of H of no association means what [two-way contingency ta- 
ble]? 


Answer. Ina testing procedure, rejection is a logical conclusion. If Hp is not rejected, 
then no valid conclusion can be made. Our decision is based on one data set. In an- 
other data set, perhaps the decision may be different unless the original data set is a 
representative of the whole population in every respect. This type of representation 
is not possible in practice. Hence the maximum that we can say is that the data seem 
to be consistent with the hypothesis when Ηρ is not rejected. Here, our situation is 
different. The hypothesis of no association is rejected. Does it not mean that there is 
possibility of association between the characteristics of classification? Since rejection 
is a valid conclusion, we must admit that this data set suggests that there is possibil- 
ity of some sort of association between the characteristics of classification or there is 
possibility of association between weights and intelligence according to this data set 
and according to this testing procedure. 


Question. If the observations are 2,5,6, how many populations are possible from 
where these observations came? 


Answer. Therecan beinfinitely many populations or all populations where the range, 
with non-zero probabilities, cover these observations. 


Question. Suppose that we computed D,, [Kolmogorov-Smirnov statistic for good- 
ness-of-fit] for one sample with n = 10 and got the number 2.3. How many different 
populations are possible where an observed sample of size 10 gave the D,, measure 
as 2.3? How logical is the statistical procedure in this situation? 


16.3 Questions and answers on tests of hypotheses —— 557 


Answer. Infinitely many populations are possible. The logical basis is very shaky. If 
Hp is rejected, it is well and good. If Ηρ is not rejected and if we wish to say anything 
about the selected model, then the statistical procedure cannot logically support the 
move. What one can say is to compute D, or any such distance measure and use the cri- 
terion “smaller the distance better the fit" and then say that the selected model seems 
to bea good fit if the distance is smaller than a preassigned number, remembering that 
there could be several other models which may also be good fits to the same data. 


Question. Are not the procedure of constructing confidence intervals the same as 
testing of hypotheses; one seems to be a complement of the other? 


Answer. Some people mix up the two procedures and give credence to the statement 
of *accepting Ho" saying that confidence intervals and *acceptance regions" coincide 
in many cases. The two procedures are different and the premises are different also. In 
some populations, such as the normal population the test statistics and pivotal quan- 
tities are similar and this aspect may give rise to this doubt. Take the binomial and 
Poisson parameters. Constructing confidence intervals is fully different from testing a 
hypothesis there. Pivotal quantities may not be there for constructing confidence inter- 
vals but under a null hypothesis the populations may be fully known. Testing is based 
on some motivating principle such as the likelihood ratio test, maximizing power, etc. 
whereas for constructing confidence intervals one may select a pivotal quantity arbi- 
trarily, the same pivotal quantity as well as different pivotal quantities giving different 
confidence intervals for the same parameter with the same confidence coefficient. The 
procedure of constructing confidence intervals and testing of hypotheses should not 
be mixed up; these two have different premises. 


Tables of percentage points 


Table 1: Binomial coefficients 


Entry: Φ = (κ) = zoom 


n x=0 1 2 3 4 5 6 7 8 

5 1 5 10 

6 1 6 15 20 

7 1 7 21 35 

8 1 8 28 56 70 

9 1 9 36 84 126 

10 1 10 45 120 210 252 

11 1 11 55 165 330 462 

12 1 12 66 220 495 792 924 

13 1 13 78 286 715 1287 1716 

14 1 14 91 364 1001 2002 3003 3432 

15 1 15 105 455 1365 3003 5005 6435 

16 1 16 120 560 1820 4368 8008 11440 12870 
17 1 17 136 680 2380 6188 12376 19448 24310 
18 1 18 153 816 3060 8568 18564 31824 43758 
19 1 19 171 969 3876 11628 27132 50388 75582 
20 1 20 190 1140 4845 15504 38760 77520 125970 
21 1 21 210 1330 5985 20349 54264 116280 203499 
22 1 22 231 1540 7315 26334 74613 170544 319770 
23 1 23 253 1771 8855 33649 100947 245157 490314 
24 1 24 276 2024 10626 42504 134596 346104 735471 
25 1 25 300 2300 12650 53130 177100 480700 1081575 
26 1 26 325 2600 14950 65780 230230 657800 1562275 
27 1 27 351 2925 17550 80730 296010 888030 2220075 
28 1 28 378 3276 20475 98280 376740 1184040 3108105 
29 1 29 406 3654 23751 118755 475020 1560780 4292145 
30 1 30 435 4060 27405 142506 593775 2035800 5852925 
n x=9 10 11 12 13 14 15 

18 48620 

19 92378 


20 167960 184756 

21 293930 352716 

22 497420 646646 705432 

23 817190 1144066 1352078 

24 1307504 1961256 2496144 2704156 

25 2042975 3268760 4457400 5200300 

26 3124550 5311735 7726160 9657700 10400600 

27 4686825 8436285 13037895 17383860 20058300 

28 6906900 3123110 21474180 30421755 37442160 40116600 
29 10015005 20030010 34597290 51895935 67863915 77558760 
30 14307150 30045015 54627300 86493225 119759850 145422675 155117520 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. EABAR] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https: //doi.org/10.1515/9783110562545-017 


560 ---- Tables of percentage points 


Table 2: Cumulative binomial probabilities 


Entry: Yr ,()p'a-p"* - Σι 


r-n—-x 


Φα = pp" 


n x p=0.05 


1 


0 
1 


uU N - ο N 


ο PWNP OO PWNP OO 


Nu FWN πο 


ο ο ο WNP Ὁ 


wW Ne oO 


0.9500 
1.0000 


0.9025 
0.9975 
1.0000 


0.8574 
0.9927 
0.9999 
1.0000 


0.8145 
0.9860 
0.9995 
1.0000 
1.0000 


0.7738 
0.9774 
0.9988 
1.0000 
1.0000 
1.0000 


0.7351 
0.9672 
0.9978 
0.9999 
1.0000 
1.0000 
1.0000 


0.6983 
0.9556 
0.9962 
0.9998 
1.0000 
1.0000 
1.0000 
1.0000 


0.6634 
0.9428 
0.9942 
0.9996 


0.10 


0.9000 
1.0000 


0.8100 
0.9900 
1.0000 


0.7290 
0.9720 
0.9990 
1.0000 


0.6561 
0.9477 
0.9963 
0.9999 
1.0000 


0.5905 
0.9185 
0.9914 
0.9995 
1.0000 
1.0000 


0.5314 
0.8857 
0.9841 
0.9987 
0.9999 
1.0000 
1.0000 


0.4783 
0.8503 
0.9743 
0.9973 
0.9998 
1.0000 
1.0000 
1.0000 


0.4305 
0.8131 
0.9619 
0.9950 


0.15 


0.8500 
1.0000 


0.7225 
0.9775 
1.0000 


0.6141 
0.9392 
0.9966 
1.0000 


0.5220 
0.8905 
0.9880 
0.9995 
1.0000 


0.4437 
0.8352 
0.9734 
0.9978 
0.9999 
1.0000 


0.3771 
0.7765 
0.9527 
0.9941 
0.9996 
1.0000 
1.0000 


0.3206 
0.7166 
0.9262 
0.9879 
0.9988 
0.9999 
1.0000 
1.0000 


0.2725 
0.6572 
0.8948 
0.9786 


0.20 


0.8000 
1.0000 


0.6400 
0.9600 
1.0000 


0.5120 
0.8960 
0.9920 
1.0000 


0.4096 
0.8912 
0.9728 
0.9984 
1.0000 


0.3277 
0.7373 
0.9421 
0.9933 
0.9997 
1.0000 


0.2621 
0.6554 
0.9011 
0.9830 
0.9984 
0.9999 
1.0000 


0.2097 
0.5767 
0.8520 
0.9667 
0.9953 
0.9996 
1.0000 
1.0000 


0.1678 
0.5033 
0.7969 
0.9437 


0.25 


0.7500 
1.0000 


0.5625 
0.9375 
1.0000 


0.4219 
0.8437 
0.9844 
1.0000 


0.3164 
0.7383 
0.9492 
0.9961 
1.0000 


0.2373 
0.6328 
0.8905 
0.9844 
0.9990 
1.0000 


0.1780 
0.5339 
0.8306 
0.9624 
0.9954 
0.9998 
1.0000 


0.1335 
0.4449 
0.7564 
0.9294 
0.9871 
0.9987 
0.9999 
1.0000 


0.1001 
0.3671 
0.6786 
0.8862 


0.30 


0.7000 
1.0000 


0.4900 
0.9100 
1.0000 


0.3430 
0.7840 
0.9730 
1.0000 


0.2401 
0.6517 
0.9163 
0.9919 
1.0000 


0.1681 
0.5282 
0.8369 
0.9692 
0.9976 
1.0000 


0.1176 
0.4202 
0.7443 
0.9295 
0.9891 
0.9993 
1.0000 


0.0824 
0.3294 
0.6471 
0.8740 
0.9712 
0.9962 
0.9998 
1.0000 


0.0576 
0.2553 
0.5518 
0.8059 


0.35 


0.6500 
1.0000 


0.4225 
0.8775 
1.0000 


0.2746 
0.7182 
0.9571 
1.0000 


0.1785 
0.5630 
0.8735 
0.9850 
1.0000 


0.1160 
0.4284 
0.7648 
0.9460 
0.9947 
1.0000 


0.0754 
0.3191 
0.6471 
0.8826 
0.9777 
0.9982 
1.0000 


0.0490 
0.2338 
0.5323 
0.8002 
0.9444 
0.9910 
0.9994 
1.0000 


0.0319 
0.1691 
0.4278 
0.7064 


0.40 


0.6000 
1.0000 


0.3600 
0.8400 
1.0000 


0.2160 
0.6480 
0.9390 
1.0000 


0.1296 
0.4752 
0.8208 
0.9744 
1.0000 


0.0778 
0.3370 
0.6826 
0.9130 
0.9898 
1.0000 


0.0467 
0.2333 
0.5443 
0.8208 
0.9590 
0.9959 
1.0000 


0.0280 
0.1586 
0.4199 
0.7102 
0.9037 
0.9812 
0.9984 
1.0000 


0.0168 
0.1064 
0.3154 
0.5941 


0.45 


0.5500 
1.0000 


0.3025 
0.7975 
1.0000 


0.1664 
0.5747 
0.9089 
1.0000 


0.0915 
0.3910 
0.7585 
0.9590 
1.0000 


0.0503 
0.2562 
0.5931 
0.8688 
0.9815 
1.0000 


0.0277 
0.1636 
0.4415 
0.7447 
0.9308 
0.9917 
1.0000 


0.0152 
0.1024 
0.3164 
0.6083 
0.8471 
0.9643 
0.9963 
1.0000 


0.0084 
0.0632 
0.2201 
0.4770 


0.50 


0.5000 
1.0000 


0.2500 
0.7500 
1.0000 


0.1250 
0.5000 
0.8750 
1.0000 


0.0625 
0.3125 
0.6875 
0.9375 
1.0000 


0.0313 
0.1875 
0.5000 
0.8125 
0.9688 
1.0000 


0.0156 
0.1094 
0.3438 
0.6562 
0.8906 
0.9844 
1.0000 


0.0078 
0.0625 
0.2266 
0.5000 
0.7734 
0.9375 
0.9922 
1.0000 


0.0039 
0.0352 
0.1445 
0.3633 


Table 2: (continued) 


n 


10 


11 


12 


x 


ο - AnA 


ο ο Νου ο ο » - ο 


OAN DAU ο ο Ν -ᾱ ο 


my 
o 


OANA UU FWNPRP OO 


e e 
ÍÓ o 


FWN FP ο 


p - 0.05 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.6302 
0.9288 
0.9916 
0.9994 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.5987 
0.9130 
0.9885 
0.9990 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.5688 
0.8981 
0.9848 
0.9984 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.5404 
0.8816 
0.9804 
0.9978 
0.9998 


0.10 


0.9996 
1.0000 
1.0000 
1.0000 
1.0000 


0.3874 
0.7748 
0.9470 
0.9917 
0.9991 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 


0.3487 
0.7361 
0.9298 
0.9872 
0.9984 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.3138 
0.6974 
0.9104 
0.9815 
0.9972 
0.9997 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.2824 
0.6590 
0.8891 
0.9744 
0.9957 


0.15 


0.9971 
0.9998 
1.0000 
1.0000 
1.0000 


0.2316 
0.5995 
0.8591 
0.9661 
0.9944 
0.9994 
1.0000 
1.0000 
1.0000 
1.0000 


0.1969 
0.5443 
0.8202 
0.9500 
0.9901 
0.9986 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 


0.1673 
0.4922 
0.7788 
0.9306 
0.9841 
0.9973 
0.9997 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.1422 
0.4435 
0.7358 
0.9078 
0.9761 


0.20 


0.9896 
0.9988 
0.9999 
1.0000 
1.0000 


0.1342 
0.4362 
0.7382 
0.9144 
0.9804 
0.9969 
0.9997 
1.0000 
1.0000 
1.0000 


0.1074 
0.3758 
0.6778 
0.8791 
0.9672 
0.9936 
0.9991 
0.9999 
1.0000 
1.0000 
1.0000 


0.0859 
0.3221 
0.6174 
0.8389 
0.9496 
0.9883 
0.9980 
0.9998 
1.0000 
1.0000 
1.0000 
1.0000 


0.0687 
0.2749 
0.5583 
0.7946 
0.9274 


0.25 


0.9727 
0.9958 
0.9996 
1.0000 
1.0000 


0.0751 
0.3003 
0.6007 
0.8343 
0.9511 
0.9900 
0.9987 
0.9999 
1.0000 
1.0000 


0.0563 
0.2440 
0.5256 
0.7759 
0.9219 
0.9803 
0.9965 
0.9996 
1.0000 
1.0000 
1.0000 


0.0859 
0.1971 
0.4552 
0.7133 
0.8854 
0.9657 
0.9924 
0.9988 
0.9999 
1.0000 
1.0000 
1.0000 


0.0317 
0.1584 
0.3907 
0.6488 
0.8424 


Tables of percentage points — 561 


0.30 


0.9420 
0.9887 
0.9987 
0.9999 
1.0000 


0.0404 
0.1960 
0.4628 
0.7297 
0.9012 
0.9747 
0.9957 
0.9996 
1.0000 
1.0000 


0.0282 
0.1493 
0.3828 
0.6496 
0.8497 
0.9527 
0.9894 
0.9984 
0.9999 
1.0000 
1.0000 


0.0422 
0.1130 
0.3127 
0.5696 
0.7897 
0.9218 
0.9784 
0.9957 
0.9994 
1.0000 
1.0000 
1.0000 


0.0138 
0.0850 
0.2528 
0.4925 
0.7237 


0.35 


0.8939 
0.9747 
0.9964 
0.9998 
1.0000 


0.0207 
0.1211 
0.3373 
0.6089 
0.8283 
0.9464 
0.9888 
0.9986 
0.9999 
1.0000 


0.0135 
0.0860 
0.2616 
0.5138 
0.7515 
0.9051 
0.9740 
0.9952 
0.9995 
1.0000 
1.0000 


0.0198 
0.0606 
0.2001 
0.4256 
0.6683 
0.8513 
0.9499 
0.9878 
0.9980 
0.9998 
1.0000 
1.0000 


0.0057 
0.0424 
0.1513 
0.3467 
0.5833 


0.40 


0.8263 
0.9502 
0.9915 
0.9993 
1.0000 


0.0101 
0.0705 
0.2318 
0.4826 
0.7334 
0.9006 
0.9750 
0.9962 
0.9997 
1.0000 


0.0060 
0.0464 
0.1673 
0.3823 
0.6331 
0.8338 
0.9452 
0.9877 
0.9983 
0.9999 
1.0000 


0.0088 
0.0302 
0.1189 
0.2963 
0.5328 
0.7535 
0.9006 
0.9707 
0.9941 
0.9993 
1.0000 
1.0000 


0.0022 
0.0196 
0.0834 
0.2253 
0.4382 


0.45 


0.7396 
0.9115 
0.9819 
0.9983 
1.0000 


0.0046 
0.0385 
0.1495 
0.3614 
0.6214 
0.8342 
0.9502 
0.9909 
0.9992 
1.0000 


0.0025 
0.0233 
0.0996 
0.2660 
0.5044 
0.7384 
0.8980 
0.9726 
0.9955 
0.9997 
1.0000 


0.0036 
0.0139 
0.0652 
0.1911 
0.3971 
0.6331 
0.8262 
0.9390 
0.9852 
0.9978 
0.9998 
1.0000 


0.0008 
0.0083 
0.0421 
0.1345 
0.3044 


0.50 


0.6367 
0.8555 
0.9648 
0.9961 
1.0000 


0.0020 
0.0195 
0.0898 
0.2539 
0.5000 
0.7461 
0.9102 
0.9805 
0.9980 
1.0000 


0.0010 
0.0107 
0.0547 
0.1710 
0.3770 
0.6230 
0.8281 
0.9453 
0.9803 
0.9990 
1.0000 


0.0005 
0.0059 
0.0327 
0.1133 
0.2744 
0.5000 
0.7256 
0.8867 
0.9673 
0.9941 
0.9995 
1.0000 


0.0002 
0.0032 
0.0193 
0.0730 
0.1938 


562 ---- Tables of percentage points 


Table 2: (continued) 


n 


13 


14 


15 


x 


ο οϱ ουσ ϱ α Ν πο 


my 
o 


PPR 
WN HA 


Ὁ AN DU RFWNP ο 


10 


p=0.05 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.5133 
0.8646 
0.9755 
0.9969 
0.9997 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.4877 
0.8470 
0.9699 
0.9958 
0.9996 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.4633 
0.8290 
0.9638 
0.9945 
0.9994 
0.9999 


0.10 


0.9995 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.2543 
0.6213 
0.8661 
0.9658 
0.9935 
0.9991 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.2288 
0.5846 
0.8416 
0.9559 
0.9908 
0.9985 
0.9998 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.2059 
0.5490 
0.8159 
0.9444 
0.9873 
0.9978 


0.15 


0.9954 
0.9993 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.1209 
0.3983 
0.6920 
0.8820 
0.9658 
0.9925 
0.9987 
0.9998 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.1028 
0.3567 
0.6479 
0.8535 
0.9533 
0.9885 
0.9978 
0.9997 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0874 
0.3186 
0.6042 
0.8227 
0.8358 
0.9832 


0.20 


0.9806 
0.9961 
0.9994 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 


0.0550 
0.2336 
0.5017 
0.7473 
0.9009 
0.9700 
0.9930 
0.9988 
0.9998 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0440 
0.1979 
0.4481 
0.6982 
0.8702 
0.9561 
0.9884 
0.9976 
0.9996 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0352 
0.1671 
0.3980 
0.6482 
0.6865 
0.9389 


0.25 


0.9456 
0.9857 
0.9972 
0.9996 
1.0000 
1.0000 
1.0000 
1.0000 


0.0238 
0.1267 
0.3326 
0.5843 
0.7940 
0.9198 
0.9757 
0.9944 
0.9990 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 


0.0178 
0.1010 
0.2811 
0.5213 
0.7415 
0.8883 
0.9617 
0.9897 
0.9978 
0.9997 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0134 
0.0802 
0.2361 
0.4613 
0.5155 
0.8516 


0.30 


0.8822 
0.9614 
0.9905 
0.9983 
0.9998 
1.0000 
1.0000 
1.0000 


0.0097 
0.0637 
0.2025 
0.4206 
0.6543 
0.8346 
0.9376 
0.9818 
0.9960 
0.9993 
0.9999 
1.0000 
1.0000 
1.0000 


0.0068 
0.0475 
0.1608 
0.3552 
0.5842 
0.7805 
0.9064 
0.9685 
0.9917 
0.9983 
0.9998 
1.0000 
1.0000 
1.0000 
1.0000 


0.0047 
0.0353 
0.1268 
0.2969 
0.3519 
0.7216 


0.35 


0.7873 
0.9154 
0.9745 
0.9944 
0.9992 
0.9999 
1.0000 
1.0000 


0.0037 
0.0296 
0.1132 
0.2783 
0.5005 
0.7159 
0.8705 
0.9538 
0.9874 
0.9975 
0.9997 
1.0000 
1.0000 
1.0000 


0.0024 
0.0205 
0.0839 
0.2205 
0.4227 
0.6405 
0.8164 
0.9247 
0.9757 
0.9940 
0.9989 
0.9999 
1.0000 
1.0000 
1.0000 


0.0016 
0.0142 
0.0617 
0.1727 
0.3519 
0.5643 


0.40 


0.6652 
0.8418 
0.9427 
0.9847 
0.9972 
0.9997 
1.0000 
1.0000 


0.0013 
0.0126 
0.0572 
0.1686 
0.3530 
0.5744 
0.7712 
0.9023 
0.9679 
0.9922 
0.9987 
0.9999 
1.0000 
1.0000 


0.0008 
0.0081 
0.0398 
0.1243 
0.2793 
0.4859 
0.6925 
0.8499 
0.9417 
0.9825 
0.9961 
0.9994 
0.9999 
1.0000 
1.0000 


0.0005 
0.0052 
0.0271 
0.0905 
0.2173 
0.4032 


0.45 


0.5269 
0.7393 
0.8883 
0.9644 
0.9921 
0.9989 
0.9999 
1.0000 


0.0004 
0.0049 
0.0269 
0.0929 
0.2279 
0.4268 
0.6437 
0.8212 
0.9302 
0.9797 
0.9959 
0.9995 
1.0000 
1.0000 


0.0002 
0.0029 
0.0170 
0.0632 
0.1672 
0.3373 
0.5461 
0.7414 
0.8811 
0.9574 
0.9886 
0.9978 
0.9997 
1.0000 
1.0000 


0.0001 
0.0017 
0.0107 
0.0424 
0.1204 
0.2608 


0.50 


0.3872 
0.6128 
0.8062 
0.9270 
0.9807 
0.9968 
0.9998 
1.0000 


0.0001 
0.0017 
0.0112 
0.0461 
0.1334 
0.2905 
0.5000 
0.7095 
0.8666 
0.9539 
0.9888 
0.9983 
0.9999 
1.0000 


0.0001 
0.0009 
0.0065 
0.0287 
0.0898 
0.2120 
0.3953 
0.6047 
0.7880 
0.9102 
0.9713 
0.9935 
0.9991 
0.9999 
1.0000 


0.0000 
0.0005 
0.0037 
0.0176 
0.0592 
0.1509 


Table 2: (continued) 


n 


16 


17 


x 


OMAN DAU FWNPRP OO 


m 
o 


T S 
uiu. μὲ 


OWAN DAU FF WNP OO 


10 


p - 0.05 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.4401 
0.8108 
0.9571 
0.9930 
0.9991 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.4181 
0.7922 
0.9497 
0.9912 
0.9988 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.10 


0.9997 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.1853 
0.5147 
0.7892 
0.9316 
0.9830 
0.9967 
0.9995 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.1668 
0.4818 
0.7618 
0.9174 
0.9779 
0.9953 
0.9992 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.15 


0.9964 
0.9994 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0743 
0.2839 
0.5614 
0.7899 
0.9209 
0.9765 
0.9944 
0.9989 
0.9998 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0631 
0.2525 
0.5198 
0.7556 
0.9013 
0.9681 
0.9917 
0.9983 
0.9997 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.20 


0.9819 
0.9958 
0.9992 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0281 
0.1407 
0.3518 
0.5981 
0.7982 
0.9183 
0.9733 
0.9930 
0.9985 
0.9998 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0225 
0.1182 
0.3096 
0.5489 
0.7582 
0.8943 
0.9623 
0.9891 
0.9974 
0.9995 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.25 


0.9434 
0.9827 
0.9958 
0.9992 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 


0.0100 
0.0635 
0.1971 
0.4050 
0.6302 
0.8103 
0.9204 
0.9729 
0.9925 
0.9984 
0.9997 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0075 
0.0501 
0.1637 
0.3530 
0.5739 
0.7653 
0.8929 
0.9598 
0.9876 
0.9969 
0.9994 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


Tables of percentage points — 563 


0.30 


0.8689 
0.9500 
0.9948 
0.9963 
0.9993 
0.9999 
1.0000 
1.0000 
1.0000 


0.0033 
0.0261 
0.0994 
0.2459 
0.4499 
0.6598 
0.8247 
0.9256 
0.9743 
0.9929 
0.9984 
0.9997 
1.0000 
1.0000 
1.0000 
1.0000 


0.0023 
0.0193 
0.0774 
0.2019 
0.3887 
0.5968 
0.7752 
0.8954 
0.9597 
0.9873 
0.9968 
0.9993 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 


0.35 


0.7548 
0.8868 
0.9578 
0.9876 
0.9972 
0.9995 
0.9999 
1.0000 
1.0000 


0.0010 
0.0098 
0.0451 
0.1339 
0.2892 
0.4900 
0.6881 
0.8406 
0.9329 
0.9771 
0.9938 
0.9987 
0.9998 
1.0000 
1.0000 
1.0000 


0.0007 
0.0067 
0.0327 
0.1028 
0.2348 
0.4197 
0.6188 
0.7872 
0.9006 
0.9617 
0.9880 
0.9970 
0.9994 
0.9999 
1.0000 
1.0000 
1.0000 


0.40 


0.6098 
0.7869 
0.9050 
0.9662 
0.9907 
0.9981 
0.9997 
1.0000 
1.0000 


0.0003 
0.0033 
0.0183 
0.0651 
0.1666 
0.3288 
0.5272 
0.7161 
0.8577 
0.9417 
0.9809 
0.9851 
0.9991 
0.9999 
1.0000 
1.0000 


0.0002 
0.0021 
0.0123 
0.0464 
0.1260 
0.2639 
0.4478 
0.4405 
0.8011 
0.9081 
0.9652 
0.9894 
0.9975 
0.9995 
0.9999 
1.0000 
1.0000 


0.45 


0.4522 
0.6535 
0.8182 
0.9231 
0.9745 
0.9937 
0.9989 
0.9999 
1.0000 


0.0001 
0.0010 
0.0066 
0.0281 
0.0853 
0.1976 
0.3660 
0.5629 
0.7441 
0.8750 
0.9514 
0.9851 
0.9965 
0.9994 
0.9999 
1.0000 


0.0000 
0.0006 
0.0041 
0.0184 
0.0596 
0.1471 
0.2902 
0.4743 
0.6626 
0.8166 
0.9174 
0.9699 
0.9914 
0.9981 
0.9997 
1.0000 
1.0000 


0.50 


0.3036 
0.5000 
0.6964 
0.8491 
0.9408 
0.9824 
0.9963 
0.9995 
1.0000 


0.0000 
0.0003 
0.0021 
0.0106 
0.0384 
0.1051 
0.2272 
0.4018 
0.5982 
0.7728 
0.8949 
0.9616 
0.9894 
0.9979 
0.9997 
1.0000 


0.0000 
0.0001 
0.0012 
0.0064 
0.0245 
0.0717 
0.1662 
0.3145 
0.5000 
0.6855 
0.8338 
0.9283 
0.9755 
0.9936 
0.9988 
0.9999 
1.0000 


564 ---- Tables of percentage points 


Table 2: (continued) 


n 


18 


19 


20 


Xx 


v: ο που FWNPRP OO 


PPRPPRP PPB 
NAA ο WNP Ὁ 


OAN DU ο «ο πο 


νὰ κ δν νὰ εν PBR 
πο ο ὦ Υ πο 


nu Ὁ ὢ Ν πο 


p=0.05 


0.3972 
0.7735 
0.9419 
0.9891 
0.9985 
0.9998 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.3774 
0.7547 
0.9335 
0.9868 
0.9980 
0.9998 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.3585 
0.7358 
0.9245 
0.9841 
0.9974 
0.9997 
1.0000 


0.10 


0.1501 
0.4503 
0.7338 
0.9018 
0.9718 
0.9936 
0.9988 
0.9998 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.1351 
0.4203 
0.7054 
0.8850 
0.9643 
0.9914 
0.9983 
0.9997 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.1216 
0.3917 
0.6769 
0.8670 
0.9568 
0.9887 
0.9976 


0.15 


0.0536 
0.2241 
0.4797 
0.7202 
0.8794 
0.9581 
0.9882 
0.9973 
0.9995 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0456 
0.1985 
0.4413 
0.6841 
0.8556 
0.9463 
0.9837 
0.9959 
0.9992 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0388 
0.1756 
0.4049 
0.6477 
0.8298 
0.9327 
0.9781 


0.20 


0.0180 
0.0991 
0.2713 
0.5010 
0.7164 
0.8671 
0.9487 
0.9837 
0.9957 
0.9991 
0.9998 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0144 
0.0829 
0.2369 
0.4551 
0.6733 
0.8369 
0.9324 
0.9767 
0.9933 
0.9984 
0.9997 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0115 
0.0692 
0.2061 
0.4114 
0.6296 
0.8042 
0.9133 


0.25 


0.0056 
0.0395 
0.1353 
0.3057 
0.5187 
0.7175 
0.8610 
0.9431 
0.9807 
0.9946 
0.9988 
0.9998 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0042 
0.0310 
0.1113 
0.2631 
0.4654 
0.6678 
0.8251 
0.9225 
0.9713 
0.9911 
0.9977 
0.9995 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0032 
0.0243 
0.0913 
0.2252 
0.4148 
0.6172 
0.7858 


0.30 


0.0016 
0.0142 
0.0600 
0.1646 
0.3327 
0.5344 
0.7217 
0.8593 
0.9404 
0.9790 
0.9939 
0.9986 
0.9997 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.0011 
0.0104 
0.0462 
0.1332 
0.2822 
0.4739 
0.6655 
0.8180 
0.9161 
0.9674 
0.9895 
0.9972 
0.9994 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 


0.0008 
0.0076 
0.0355 
0.1071 
0.2375 
0.4164 
0.6080 


0.35 


0.0004 
0.0046 
0.0236 
0.0783 
0.1886 
0.3550 
0.5491 
0.7283 
0.8609 
0.9403 
0.9788 
0.9938 
0.9956 
0.9997 
1.0000 
1.0000 
1.0000 
1.0000 


0.0003 
0.0031 
0.0170 
0.0591 
0.1500 
0.2968 
0.4812 
0.6656 
0.8145 
0.9125 
0.9653 
0.9886 
0.9969 
0.9993 
0.9999 
1.0000 
1.0000 
1.0000 


0.0002 
0.0021 
0.0121 
0.0444 
0.1182 
0.2454 
0.4166 


0.40 


0.0001 
0.0013 
0.0082 
0.0328 
0.0942 
0.2088 
0.3743 
0.5634 
0.7368 
0.8653 
0.9424 
0.9797 
0.9942 
0.9987 
0.9998 
1.0000 
1.0000 
1.0000 


0.0001 
0.0008 
0.0055 
0.0230 
0.0696 
0.1629 
0.3081 
0.4878 
0.6675 
0.8139 
0.9115 
0.9648 
0.9884 
0.9969 
0.9994 
0.9999 
1.0000 
1.0000 


0.0000 
0.0005 
0.0036 
0.0160 
0.0510 
0.1256 
0.2500 


0.45 


0.0000 
0.0003 
0.0025 
0.0120 
0.0411 
0.1077 
0.2258 
0.3915 
0.5778 
0.7473 
0.8720 
0.9463 
0.9817 
0.9951 
0.9990 
0.9999 
1.0000 
1.0000 


0.0000 
0.0002 
0.0015 
0.0077 
0.0280 
0.0777 
0.1727 
0.3169 
0.4940 
0.6710 
0.8159 
0.9129 
0.9658 
0.9891 
0.9972 
0.9995 
0.9999 
1.0000 


0.0000 
0.0001 
0.0009 
0.0049 
0.0189 
0.0553 
0.1299 


0.50 


0.0000 
0.0001 
0.0007 
0.0038 
0.0154 
0.0481 
0.1189 
0.2403 
0.4073 
0.5927 
0.7597 
0.8811 
0.9519 
0.9846 
0.9962 
0.9993 
0.9999 
1.0000 


0.0000 
0.0000 
0.0004 
0.0022 
0.0096 
0.0318 
0.0835 
0.1796 
0.3238 
0.5000 
0.6762 
0.8204 
0.9165 
0.9682 
0.9904 
0.9978 
0.9996 
1.0000 


0.0000 
0.0000 
0.0002 
0.0013 
0.0059 
0.0207 
0.0577 


Table 2: (continued) 


nx p-0.05 0.10 0.15 0.20 


0.25 


7 1.0000 0.9996 0.9941 0.9679 0.8982 

8 31.0000 0.9999 0.9987 0.9900 0.9591 

9 1.0000 1.0000 0.9998 0.9974 0.9861 
10 1.0000 1.0000 1.0000 0.9994 0.9961 
11 1.0000 1.0000 1.0000 0.9999 0.9991 
12 1.0000 1.0000 1.0000 1.0000 0.9998 
13 1.0000 1.0000 1.0000 1.0000 1.0000 
14 1.0000 1.0000 1.0000 1.0000 1.0000 
15 1.0000 1.0000 1.0000 1.0000 1.0000 
16 1.0000 1.0000 1.0000 1.0000 1.0000 
17 41.0000 1.0000 1.0000 1.0000 1.0000 
18 1.0000 1.0000 1.0000 1.0000 1.0000 


Table 3: Cumulative poisson probabilities 


Entry: 37. , is 


n 


Tables of percentage points ---- 565 


0.30 


0.7723 
0.8867 
0.9520 
0.9829 
0.9949 
0.9987 
0.9997 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


0.35 


0.6010 
0.7624 
0.8782 
0.9468 
0.9804 
0.9940 
0.9985 
0.9997 
0.9999 
1.0000 
1.0000 
1.0000 


0.40 


0.4159 
0.5956 
0.7553 
0.8725 
0.9435 
0.9790 
0.9935 
0.9984 
0.9997 
1.0000 
1.0000 
1.0000 


0.45 


0.2520 
0.4143 
0.5914 
0.7507 
0.8692 
0.9420 
0.9786 
0.9936 
0.9985 
0.9997 
1.0000 
1.0000 


0.50 


0.1316 
0.2517 
0.4119 
0.5881 
0.7483 
0.8684 
0.9423 
0.9793 
0.9941 
0.9987 
0.9998 
1.0000 


λξοα 0.2 0.3 0.4 


0.9048 0.8187 0.7408 0.6703 
0.9953 0.9825 0.9631 0.9384 
0.9998 0.9989 0.9964 0.9921 
1.0000 0.9999 0.9997 0.9992 
1.0000 1.0000 1.0000 0.9999 
1.0000 1.0000 1.0000 1.0000 
1.0000 1.0000 1.0000 1.0000 
1.0000 1.0000 1.0000 1.0000 


NOU PWNP ο κ 


A=11 12 1.3 1.4 


κ 


0.3329 0.3012 0.2725 0.2466 
0.6990 0.6626 0.6268 0.5918 
0.9004 0.8795 0.8571 0.8335 
0.9743 0.9662 0.9569 0.9463 
0.9946 0.9923 0.9893 0.9857 
0.9990 0.9985 0.9978 0.9968 
0.9999 0.9997 0.9996 0.9994 
1.0000 1.0000 0.9999 0.9999 
1.0000 1.0000 1.0000 1.0000 
1.0000 1.0000 1.0000 1.0000 


O WAN AU 5 ὁ Νο 


0.5 


0.6065 
0.9098 
0.9856 
0.9982 
0.9998 
1.0000 
1.0000 
1.0000 


1.5 


0.2231 
0.5578 
0.8088 
0.9344 
0.9814 
0.9955 
0.9991 
0.9998 
1.0000 
1.0000 


0.6 


0.5488 
0.8781 
0.9769 
0.9966 
0.9996 
1.0000 
1.0000 
1.0000 


1.6 


0.2019 
0.5249 
0.7834 
0.9212 
0.9763 
0.9940 
0.9987 
0.9997 
1.0000 
1.0000 


0.7 


0.4966 
0.8442 
0.9659 
0.9942 
0.9992 
0.9999 
1.0000 
1.0000 


1.7 


0.1827 
0.4932 
0.7572 
0.9068 
0.9704 
0.9920 
0.9981 
0.9996 
0.9999 
1.0000 


0.8 


0.4493 
0.8088 
0.9526 
0.9909 
0.9986 
0.9998 
1.0000 
1.0000 


1.8 


0.1653 
0.4628 
0.7306 
0.8913 
0.9636 
0.9868 
0.9974 
0.9994 
0.9999 
1.0000 


0.9 


0.4066 
0.7725 
0.9371 
0.9865 
0.9977 
0.9997 
1.0000 
1.0000 


1.9 


0.1496 
0.4337 
0.7037 
0.8747 
0.9559 
0.9868 
0.9966 
0.9992 
0.9998 
1.0000 


1.0 


0.3679 
0.7358 
0.9197 
0.9810 
0.9963 
0.9994 
0.9999 
1.0000 


2.0 


0.1353 
0.4060 
0.6767 
0.8571 
0.9473 
0.9834 
0.9955 
0.9989 
0.9998 
1.0000 


566 ---- Tables of percentage points 


Table 3: (continued) 


O 0005 WNP ο 


my 
o 


e e 
N e 


ος σιν ω-”:ο]|Χπ 


κ» ελ αλ π» Ἐν 
BWNRPRO 


κ 


coo NOAA FWNP OC 


À-24 


0.1225 
0.3796 
0.6496 
0.8386 
0.9379 
0.9796 
0.9941 
0.9985 
0.9997 
0.9999 
1.0000 
1.0000 
1.0000 


λΞ3.1 


0.0450 
0.1857 
0.4012 
0.6248 
0.7982 
0.9057 
0.9612 
0.9858 
0.9953 
0.9986 
0.9996 
0.9999 
1.0000 
1.0000 
1.0000 


A=41 


0.0166 
0.0845 
0.2238 
0.4142 
0.6093 
0.7693 
0.8786 
0.9427 
0.9755 


2.2 


0.1108 
0.3546 
0.6227 
0.8194 
0.9275 
0.9751 
0.9925 
0.9980 
0.9995 
0.9999 
1.0000 
1.0000 
1.0000 


3.2 


0.0408 
0.1712 
0.3799 
0.6025 
0.7806 
0.8946 
0.9554 
0.9832 
0.9943 
0.9982 
0.9995 
0.9999 
1.0000 
1.0000 
1.0000 


4.2 


0.0150 
0.0780 
0.2102 
0.3954 
0.5898 
0.7531 
0.8675 
0.9361 
0.9721 


2.3 


0.1003 
0.3309 
0.5960 
0.7993 
0.9162 
0.9700 
0.9906 
0.9974 
0.9994 
0.9999 
1.0000 
1.0000 
1.0000 


3.3 


0.0369 
0.1586 
0.3594 
0.5803 
0.7626 
0.8829 
0.9490 
0.9802 
0.9931 
0.9978 
0.9994 
0.9998 
1.0000 
1.0000 
1.0000 


4.3 


0.0136 
0.0719 
0.1974 
0.3772 
0.5704 
0.7367 
0.8558 
0.9290 
0.9683 


2.4 


0.0907 
0.3084 
0.5967 
0.7787 
0.9041 
0.9643 
0.9884 
0.9967 
0.9991 
0.9998 
1.0000 
1.0000 
1.0000 


3.4 


0.0334 
0.1468 
0.3397 
0.5584 
0.7442 
0.8705 
0.9421 
0.9769 
0.9917 
0.9973 
0.9992 
0.9998 
0.9999 
1.0000 
1.0000 


4.4 


0.0123 
0.0663 
0.1851 
0.3594 
0.5512 
0.7190 
0.8436 
0.9214 
0.9642 


2.5 


0.0821 
0.2873 
0.5438 
0.7576 
0.8912 
0.9580 
0.9858 
0.9958 
0.9989 
0.9997 
0.9999 
1.0000 
1.0000 


3.5 


0.0302 
0.1359 
0.3208 
0.5366 
0.7254 
0.8576 
0.9247 
0.9733 
0.9901 
0.9967 
0.9990 
0.9997 
0.9999 
1.0000 
1.0000 


4.5 


0.0111 
0.0611 
0.1736 
0.3423 
0.5321 
0.7029 
0.8311 
0.9134 
0.9597 


2.6 


0.0743 
0.2674 
0.5184 
0.7360 
0.8774 
0.9510 
0.9828 
0.9947 
0.9985 
0.9996 
0.9999 
1.0000 
1.0000 


3.6 


0.0273 
0.1257 
0.3027 
0.5152 
0.7064 
0.8441 
0.9267 
0.9692 
0.9883 
0.9960 
0.9987 
0.9996 
0.9999 
1.0000 
1.0000 


4.6 


0.0101 
0.0563 
0.1626 
0.3257 
0.5132 
0.6858 
0.8180 
0.9049 
0.9549 


2.7 


0.0672 
0.2487 
0.4396 
0.7141 
0.8629 
0.9433 
0.9794 
0.9934 
0.9981 
0.9995 
0.9999 
1.0000 
1.0000 


3.7 


0.0247 
0.1162 
0.2854 
0.4942 
0.6872 
0.8301 
0.9182 
0.9648 
0.9863 
0.9952 
0.9984 
0.9995 
0.9999 
1.0000 
1.0000 


4.7 


0.0091 
0.0518 
0.1523 
0.3097 
0.4946 
0.6684 
0.8046 
0.8960 
0.9497 


2.8 


0.0608 
0.2311 
0.4695 
0.6919 
0.8477 
0.9349 
0.9756 
0.9919 
0.9976 
0.9993 
0.9998 
1.0000 
1.0000 


3.8 


0.0224 
0.1074 
0.2689 
0.4735 
0.6678 
0.8156 
0.9091 
0.9599 
0.9840 
0.9942 
0.9981 
0.9994 
0.9998 
1.0000 
1.0000 


4.8 


0.0082 
0.0477 
0.1425 
0.2942 
0.4763 
0.6510 
0.7908 
0.8867 
0.9442 


2.9 


0.0550 
0.2146 
0.4460 
0.6696 
0.8318 
0.9258 
0.9713 
0.9901 
0.9969 
0.9991 
0.9998 
0.9999 
1.0000 


3.9 


0.0202 
0.0992 
0.2531 
0.4532 
0.6484 
0.8006 
0.8995 
0.9546 
0.9815 
0.9931 
0.9977 
0.9993 
0.9998 
0.9999 
1.0000 


4.9 


0.0074 
0.0439 
0.1333 
0.2793 
0.4582 
0.6335 
0.7767 
0.8769 
0.9382 


3.0 


0.0498 
0.1991 
0.4232 
0.6472 
0.8153 
0.9161 
0.9665 
0.9881 
0.9962 
0.9989 
0.9997 
0.9999 
1.0000 


4.0 


0.0183 
0.0916 
0.2381 
0.4335 
0.6288 
0.7851 
0.8893 
0.9489 
0.9786 
0.9919 
0.9972 
0.9991 
0.9997 
0.9999 
1.0000 


5.0 


0.0067 
0.0404 
0.1247 
0.2650 
0.4405 
0.6160 
0.7622 
0.8066 
0.9319 


Table 3: (continued) 


x À-44 


9 0.9905 
10 0.9966 
11 0.9989 
12 0.9997 
13 0.9999 
14 1.0000 
15 1.0000 


x Az=51 


0.0061 
0.0372 
0.1165 
0.2513 
0.4231 
0.5984 
0.7474 
0.8560 
0.9252 
0.9644 
0.9844 
0.9937 
0.9976 
0.9992 
0.9997 
0.9999 
1.0000 
1.0000 


OMAN AUN FF WNP OO 


PRPRPP ε» εν εν BR 
NAA ο ὦ Ν - Ὁ 


X λ-6.1 


0.0022 
0.0159 
0.0577 
0.1425 
0.2719 
0.4298 
0.5902 
0.7301 
0.8367 
0.9090 
0.9531 
11 0.9776 


v: 00 uf WNP OO 


my 
o 


4.2 


0.9889 
0.9959 
0.9986 
0.9996 
0.9999 
1.0000 
1.0000 


5.2 


0.0055 
0.0342 
0.1088 
0.2381 
0.4061 
0.5809 
0.7324 
0.8449 
0.9181 
0.9603 
0.9823 
0.9927 
0.9972 
0.9990 
0.9997 
0.9999 
1.0000 
1.0000 


6.2 


0.0020 
0.0146 
0.0536 
0.1342 
0.2592 
0.4141 
0.5742 
0.7160 
0.8259 
0.9016 
0.9486 
0.9750 


4.3 


0.9871 
0.9952 
0.9983 
0.9995 
0.9998 
1.0000 
1.0000 


5.3 


0.0050 
0.0314 
0.1016 
0.2254 
0.3895 
0.5635 
0.7171 
0.8335 
0.9106 
0.9559 
0.9800 
0.9916 
0.9967 
0.9988 
0.9996 
0.9999 
1.0000 
1.0000 


6.3 


0.0018 
0.0134 
0.0498 
0.1264 
0.2469 
0.3988 
0.5582 
0.7017 
0.8148 
0.8939 
0.9437 
0.9723 


4.4 


0.9851 
0.9943 
0.9980 
0.9993 
0.9998 
0.9999 
1.0000 


5.4 


0.0045 
0.0266 
0.0948 
0.2133 
0.3733 
0.5461 
0.7017 
0.8217 
0.9027 
0.9512 
0.9775 
0.9904 
0.9962 
0.9986 
0.9995 
0.9998 
0.9999 
1.0000 


6.4 


0.0017 
0.0123 
0.0463 
0.1189 
0.2351 
0.3837 
0.5423 
0.6873 
0.8033 
0.8858 
0.9386 
0.9693 


4.5 


0.9829 
0.9933 
0.9976 
0.9992 
0.9997 
0.9999 
1.0000 


5.5 


0.0041 
0.0244 
0.0884 
0.0217 
0.3575 
0.5289 
0.6860 
0.8095 
0.8944 
0.9462 
0.9747 
0.9890 
0.9955 
0.9983 
0.9994 
0.9998 
0.9999 
1.0000 


6.5 


0.0015 
0.0113 
0.0430 
0.1118 
0.2237 
0.3690 
0.5265 
0.6728 
0.7916 
0.8774 
0.9332 
0.9661 


Tables of percentage points — 567 


4.6 


0.9805 
0.9922 
0.9971 
0.9990 
0.9997 
0.9999 
1.0000 


5.6 


0.0037 
0.0244 
0.0824 
0.1906 
0.3422 
0.5119 
0.6703 
0.7970 
0.8857 
0.9409 
0.9718 
0.9875 
0.9949 
0.9980 
0.9993 
0.9998 
0.9999 
1.0000 


6.6 


0.0014 
0.0103 
0.0400 
0.1052 
0.2127 
0.3547 
0.5108 
0.6581 
0.7796 
0.8686 
0.9274 
0.9627 


4.7 


0.9778 
0.9910 
0.9966 
0.9988 
0.9996 
0.9999 
1.0000 


5.7 


0.0033 
0.0224 
0.0768 
0.1800 
0.3272 
0.4950 
0.6544 
0.7841 
0.8766 
0.9352 
0.9686 
0.9859 
0.9941 
0.9977 
0.9991 
0.9997 
0.9999 
1.0000 


6.7 


0.0012 
0.0095 
0.0371 
0.0988 
0.2022 
0.3406 
0.4953 
0.6433 
0.7673 
0.8596 
0.9214 
0.9591 


4.8 


0.9749 
0.9896 
0.9960 
0.9986 
0.9995 
0.9999 
1.0000 


5.8 


0.0030 
0.0206 
0.0715 
0.1700 
0.3127 
0.4783 
0.6384 
0.7710 
0.8672 
0.9292 
0.9651 
0.9841 
0.9932 
0.9973 
0.9990 
0.9996 
0.9999 
1.0000 


6.8 


0.0011 
0.0087 
0.0344 
0.0928 
0.1920 
0.3270 
0.4799 
0.6285 
0.7548 
0.8502 
0.9151 
0.9552 


4.9 


0.9717 
0.9880 
0.9953 
0.9983 
0.9994 
0.9998 
0.9999 


5.9 


0.0027 
0.0189 
0.0666 
0.1604 
0.2987 
0.4619 
0.6224 
0.7576 
0.8574 
0.9228 
0.9614 
0.9821 
0.9922 
0.9969 
0.9988 
0.9996 
0.9999 
1.0000 


6.9 


0.0010 
0.0080 
0.0320 
0.0871 
0.1823 
0.3137 
0.4647 
0.6136 
0.7420 
0.8405 
0.9084 
0.9510 


5.0 


0.9682 
0.9863 
0.9945 
0.9980 
0.9993 
0.9998 
0.9999 


6.0 


0.0025 
0.0174 
0.0620 
0.1512 
0.2851 
0.4457 
0.6063 
0.7440 
0.8472 
0.9161 
0.9574 
0.9799 
0.9912 
0.9964 
0.9986 
0.9995 
0.9998 
0.9999 


7.0 


0.0009 
0.0073 
0.0296 
0.0818 
0.1730 
0.3007 
0.4497 
0.5987 
0.7291 
0.8305 
0.9015 
0.9467 


568 ---- Tables of percentage points 


Table 3: (continued) 


X λ-6.1 


12 0.9900 
13 0.9958 
14 0.9984 
15 0.9994 
16 0.9998 
17 0.9999 
18 1.0000 
19 1.0000 


x <A=71 


0.0008 
0.0067 
0.0275 
0.0767 
0.1641 
0.2881 
0.4349 
0.5838 
0.7160 
0.8202 
0.8942 
0.9420 
0.9703 
0.9857 
0.9935 
0.9972 
0.9989 
17 0.9996 
18 0.9998 
19 0.9999 
20 1.0000 
21 1.0000 


O AN DAU FWNPRP ο 


κ» εδ HB εν νὰ εν 
συ P ὦ Νν - Ο 


λ-8.1 


0.0003 
0.0028 
0.0127 
0.0396 
0.0940 
0.1822 
0.3013 
0.4391 


NOU PWNP OT κ 


6.2 


0.9887 
0.9952 
0.9981 
0.9993 
0.9997 
0.9999 
1.0000 
1.0000 


7.2 


0.0007 
0.0061 
0.0255 
0.0719 
0.1555 
0.2759 
0.4204 
0.5689 
0.7027 
0.8096 
0.8867 
0.9371 
0.9673 
0.9841 
0.9927 
0.9969 
0.9987 
0.9995 
0.9998 
0.9999 
1.0000 
1.0000 


8.2 


0.0003 
0.0025 
0.0118 
0.0370 
0.0887 
0.1736 
0.2896 
0.4254 


6.3 


0.9873 
0.9945 
0.9978 
0.9992 
0.9997 
0.9999 
1.0000 
1.0000 


7.3 


0.0007 
0.0056 
0.0236 
0.0674 
0.1473 
0.2640 
0.4060 
0.5541 
0.6892 
0.7988 
0.8788 
0.9319 
0.9642 
0.9824 
0.9918 
0.9964 
0.9985 
0.9994 
0.9998 
0.9999 
1.0000 
1.0000 


8.3 


0.0002 
0.0023 
0.0109 
0.0346 
0.0837 
0.1653 
0.2781 
0.4119 


6.4 


0.9857 
0.9937 
0.9974 
0.9990 
0.9996 
0.9999 
1.0000 
1.0000 


7.4 


0.0006 
0.0051 
0.0219 
0.0632 
0.1395 
0.2526 
0.3920 
0.5393 
0.6757 
0.7877 
0.8707 
0.9265 
0.9609 
0.9805 
0.9908 
0.9959 
0.9983 
0.9993 
0.9997 
0.9999 
1.0000 
1.0000 


8.4 


0.0002 
0.0021 
0.0100 
0.0323 
0.0789 
0.1573 
0.2670 
0.3987 


6.5 


0.9840 
0.9929 
0.9970 
0.9988 
0.9996 
0.9998 
0.9999 
1.0000 


7.5 


0.0006 
0.0047 
0.0203 
0.0591 
0.1321 
0.2414 
0.3782 
0.5246 
0.6620 
0.7764 
0.8622 
0.9208 
0.9573 
0.9784 
0.9897 
0.9954 
0.9980 
0.9992 
0.9997 
0.9999 
1.0000 
1.0000 


8.5 


0.0002 
0.0019 
0.0093 
0.0301 
0.0744 
0.1496 
0.2562 
0.3856 


6.6 


0.9821 
0.9920 
0.9966 
0.9986 
0.9995 
0.9998 
0.9999 
1.0000 


7.6 


0.0005 
0.0043 
0.0188 
0.0554 
0.1249 
0.2307 
0.3646 
0.5100 
0.6482 
0.7649 
0.8535 
0.9148 
0.9536 
0.9762 
0.9886 
0.9948 
0.9978 
0.9991 
0.9996 
0.9999 
1.0000 
1.0000 


8.6 


0.0002 
0.0018 
0.0086 
0.0281 
0.0701 
0.1422 
0.2457 
0.3728 


6.7 


0.9801 
0.9909 
0.9961 
0.9984 
0.9994 
0.9998 
0.9999 
1.0000 


7.7 


0.0005 
0.0039 
0.0174 
0.0518 
0.1181 
0.2203 
0.3514 
0.4596 
0.6343 
0.7531 
0.8445 
0.9085 
0.9496 
0.9739 
0.9873 
0.9941 
0.9974 
0.9989 
0.9996 
0.9998 
0.9999 
1.0000 


8.7 


0.0002 
0.0016 
0.0079 
0.0262 
0.0660 
0.1352 
0.2355 
0.3602 


6.8 


0.9779 
0.9898 
0.9956 
0.9982 
0.9993 
0.9997 
0.9999 
1.0000 


7.8 


0.0004 
0.0036 
0.0161 
0.0485 
0.1117 
0.2103 
0.3384 
0.4812 
0.6204 
0.7411 
0.8352 
0.9020 
0.9454 
0.9714 
0.9859 
0.9934 
0.9971 
0.9988 
0.9995 
0.9998 
0.9999 
1.0000 


8.8 


0.0002 
0.0015 
0.0073 
0.0244 
0.0621 
0.1284 
0.2256 
0.3478 


6.9 


0.9755 
0.9885 
0.9950 
0.9979 
0.9992 
0.9997 
0.9999 
1.0000 


7.9 


0.0004 
0.0033 
0.0149 
0.0453 
0.1055 
0.2006 
0.3257 
0.4670 
0.6065 
0.7290 
0.8257 
0.8952 
0.9309 
0.9087 
0.9844 
0.9926 
0.9967 
0.9986 
0.9994 
0.9998 
0.9999 
1.0000 


8.9 


0.0001 
0.0014 
0.0068 
0.0228 
0.0584 
0.1219 
0.2160 
0.3357 


7.0 


0.9730 
0.9872 
0.9943 
0.9976 
0.9990 
0.9996 
0.9999 
1.0000 


8.0 


0.0003 
0.0030 
0.0138 
0.0424 
0.0996 
0.1912 
0.3134 
0.4530 
0.5925 
0.7166 
0.8159 
0.8881 
0.9362 
0.9658 
0.9827 
0.9918 
0.9963 
0.9984 
0.9993 
0.9997 
0.9999 
1.0000 


9.0 


0.0001 
0.0012 
0.0062 
0.0212 
0.0550 
0.1157 
0.2068 
0.3239 


Table 3: (continued) 


x A=81 


8 0.5786 

9 0.7041 
10 0.8058 
11 0.8807 
12 0.9313 
13 0.9628 
14 0.9810 
15 0.9908 
16 0.9958 
17 0.9982 
18 0.9992 
19 0.9997 
20 0.9999 
21 1.0000 
22 1.0000 


x A=91 


0.0001 
0.0011 
0.0058 
0.0198 
0.0517 
0.1098 
0.1978 
0.3123 
0.4126 

9 0.5742 
10 0.6941 
11 0.7932 
12 0.8684 
13 0.9210 
14 0.9552 
15 0.9760 
16 0.9878 
17 0.9941 
18 0.9973 
19 0.9988 
20 0.9995 
21 0.9998 
22 0.9999 
23 1.0000 
24 1.0000 


ONDA UU 5  » πα ο 


8.2 


0.5647 
0.6915 
0.7955 
0.8731 
0.9261 
0.9595 
0.9791 
0.9898 
0.9953 
0.9979 
0.9991 
0.9997 
0.9999 
1.0000 
1.0000 


9.2 


0.0001 
0.0010 
0.0053 
0.0184 
0.0486 
0.1041 
0.1892 
0.3010 
0.4296 
0.5611 
0.6820 
0.8732 
0.8607 
0.9156 
0.9517 
0.9738 
0.9865 
0.9934 
0.9969 
0.9986 
0.9994 
0.9998 
0.9999 
1.0000 
1.0000 


8.3 


0.5507 
0.6788 
0.7850 
0.8652 
0.9207 
0.9561 
0.9771 
0.9887 
0.9947 
0.9977 
0.9990 
0.9996 
0.9998 
0.9999 
1.0000 


9.3 


0.0001 
0.0009 
0.0049 
0.0172 
0.0456 
0.0986 
0.1808 
0.2900 
0.4168 
0.5479 
0.6699 
0.7730 
0.8529 
0.9100 
0.9480 
0.9715 
0.9852 
0.9927 
0.9966 
0.9985 
0.9993 
0.9997 
0.9999 
1.0000 
1.0000 


8.4 


0.5369 
0.6659 
0.7743 
0.8571 
0.9150 
0.9524 
0.9749 
0.9875 
0.9941 
0.9973 
0.9989 
0.9995 
0.9998 
0.9999 
1.0000 


9.4 


0.0001 
0.0009 
0.0045 
0.0160 
0.0429 
0.0935 
0.1727 
0.2792 
0.4042 
0.5349 
0.6576 
0.7626 
0.8448 
0.9042 
0.9441 
0.9691 
0.9838 
0.9919 
0.9962 
0.9983 
0.9992 
0.9997 
0.9999 
1.0000 
1.0000 


8.5 


0.5231 
0.6530 
0.7634 
0.8487 
0.9091 
0.9486 
0.9726 
0.9862 
0.9934 
0.9970 
0.9987 
0.9995 
0.9998 
0.9999 
1.0000 


9.5 


0.0001 
0.0008 
0.0042 
0.0149 
0.0403 
0.0885 
0.1649 
0.2687 
0.3918 
0.5218 
0.6453 
0.7520 
0.8364 
0.8981 
0.9400 
0.9665 
0.9823 
0.9911 
0.9957 
0.9980 
0.9991 
0.9996 
0.9999 
0.9999 
1.0000 


Tables of percentage points —— 569 


8.6 


0.5094 
0.6400 
0.7522 
0.8400 
0.9029 
0.9445 
0.9701 
0.9848 
0.9926 
0.9966 
0.9985 
0.9994 
0.9998 
0.9999 
1.0000 


9.6 


0.0001 
0.0007 
0.0038 
0.0138 
0.0378 
0.0838 
0.1574 
0.2584 
0.3798 
0.5089 
0.6329 
0.7412 
0.8279 
0.8919 
0.9357 
0.9638 
0.9806 
0.9902 
0.9952 
0.9978 
0.9990 
0.9996 
0.9998 
0.9999 
1.0000 


8.7 


0.4958 
0.6269 
0.7409 
0.8311 
0.8965 
0.9403 
0.9675 
0.9832 
0.9918 
0.9962 
0.9983 
0.9993 
0.9997 
0.9999 
1.0000 


9.7 


0.00001 
0.0007 
0.0035 
0.0129 
0.0355 
0.0793 
0.1502 
0.2485 
0.3676 
0.4960 
0.6205 
0.7303 
0.8191 
0.8853 
0.9312 
0.9609 
0.9789 
0.9892 
0.9947 
0.9975 
0.9989 
0.9995 
0.9998 
0.9999 
1.0000 


8.8 


0.4823 
0.6137 
0.7294 
0.8220 
0.8898 
0.9358 
0.9647 
0.9816 
0.9909 
0.9957 
0.9981 
0.9992 
0.9997 
0.9999 
1.0000 


9.8 


0.0001 
0.0006 
0.0033 
0.0120 
0.0333 
0.0750 
0.1433 
0.2388 
0.3558 
0.4832 
0.6080 
0.7193 
0.8101 
0.8786 
0.9265 
0.9579 
0.9770 
0.9881 
0.9941 
0.9972 
0.9987 
0.9995 
0.9998 
0.9999 
1.0000 


8.9 


0.4689 
0.6006 
0.7178 
0.8126 
0.8829 
0.9311 
0.9617 
0.9798 
0.9899 
0.9952 
0.9978 
0.9991 
0.9996 
0.9998 
0.9999 


9.9 


0.0001 
0.0005 
0.0030 
0.0111 
0.0312 
0.0710 
0.1366 
0.2294 
0.3442 
0.4705 
0.5955 
0.7081 
0.8009 
0.8716 
0.9216 
0.9546 
0.9751 
0.9870 
0.9935 
0.9969 
0.9986 
0.9994 
0.9997 
0.9999 
1.0000 


9.0 


0.4557 
0.5874 
0.7060 
0.8030 
0.8758 
0.9261 
0.9585 
0.9780 
0.9889 
0.9947 
0.9976 
0.9989 
0.9996 
0.9998 
0.9999 


10.0 


0.0000 
0.0005 
0.0028 
0.0103 
0.0293 
0.0671 
0.1301 
0.2202 
0.3328 
0.4579 
0.5830 
0.6968 
0.7916 
0.8615 
0.9165 
0.9513 
0.9730 
0.9857 
0.9928 
0.9965 
0.9984 
0.9994 
0.9997 
0.9999 
1.0000 


570 ---- Tables of percentage points 


Table 3: (continued) 


O 0005 WNP ο 


WWWWWWWWWWNNNNNNNNNN PRP RP RP - μὲ μὲ μὰ HM H 
OAN DOF WNP ο Ὁ WAN KDUOFWNP TWO AN DAU δν πο 


A=11 


0.0000 
0.0002 
0.0012 
0.0049 
0.0151 
0.0375 
0.0786 
0.1432 
0.2320 
0.3405 
0.4599 
0.5793 
0.6887 
0.7813 
0.8540 
0.9074 
0.9441 
0.9678 
0.9823 
0.9907 
0.9953 
0.9977 
0.9990 
0.9995 
0.9998 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


12 


0.0000 
0.0001 
0.0005 
0.0023 
0.0076 
0.0203 
0.0458 
0.0895 
0.1550 
0.2424 
0.3472 
0.4616 
0.5760 
0.6815 
0.7720 
0.8444 
0.8987 
0.9370 
0.9626 
0.9787 
0.9884 
0.9939 
0.9970 
0.9985 
0.9993 
0.9997 
0.9999 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


13 


0.0000 
0.0000 
0.0002 
0.0011 
0.0037 
0.0107 
0.0259 
0.0540 
0.0998 
0.1658 
0.2517 
0.3532 
0.4631 
0.5730 
0.6751 
0.7636 
0.8355 
0.8905 
0.9302 
0.9573 
0.9750 
0.9859 
0.9924 
0.9960 
0.9980 
0.9990 
0.9995 
0.9998 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


14 


0.0000 
0.0000 
0.0001 
0.0005 
0.0018 
0.0055 
0.0142 
0.0316 
0.0621 
0.1094 
0.1757 
0.2600 
0.3585 
0.4644 
0.5704 
0.6694 
0.7559 
0.8272 
0.8826 
0.9325 
0.9521 
0.9712 
0.9833 
0.9907 
0.9950 
0.9974 
0.9987 
0.9994 
0.9997 
0.9999 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


15 


0.0000 
0.0000 
0.0000 
0.0002 
0.0009 
0.0028 
0.0076 
0.0180 
0.0374 
0.0699 
0.1185 
0.1848 
0.2676 
0.3032 
0.4657 
0.5681 
0.6641 
0.7489 
0.8195 
0.8752 
0.9170 
0.9469 
0.9673 
0.9805 
0.9888 
0.9938 
0.9967 
0.9983 
0.9991 
0.9996 
0.9998 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


16 


0.0000 
0.0000 
0.0000 
0.0001 
0.0004 
0.0014 
0.0040 
0.0100 
0.0220 
0.0433 
0.0774 
0.1270 
0.1931 
0.2745 
0.3675 
0.4667 
0.5660 
0.6593 
0.7423 
0.8122 
0.8682 
0.9108 
0.9418 
0.9633 
0.9777 
0.9869 
0.9925 
0.9959 
0.9978 
0.9989 
0.9994 
0.9997 
0.9999 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


17 


0.0000 
0.0000 
0.0000 
0.0000 
0.0002 
0.0007 
0.0021 
0.0054 
0.0126 
0.0261 
0.0491 
0.0847 
0.1350 
0.2009 
0.2808 
0.3715 
0.4677 
0.5640 
0.6550 
0.7363 
0.8055 
0.8615 
0.9047 
0.9367 
0.9594 
0.9748 
0.9848 
0.9912 
0.9950 
0.9973 
0.9986 
0.9993 
0.9996 
0.9998 
0.9999 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


18 


0.0000 
0.0000 
0.0000 
0.0000 
0.0001 
0.0003 
0.0010 
0.0029 
0.0071 
0.0154 
0.0304 
0.0549 
0.0917 
0.1426 
0.2081 
0.2867 
0.3751 
0.4686 
0.5622 
0.6509 
0.7307 
0.7991 
0.8551 
0.8989 
0.9317 
0.9554 
0.9718 
0.9827 
0.9897 
0.9941 
0.9967 
0.9982 
0.9990 
0.9995 
0.9998 
0.9999 
0.9999 
1.0000 
1.0000 
1.0000 


19 


0.0000 
0.0000 
0.0000 
0.0000 
0.0000 
0.0002 
0.0005 
0.0015 
0.0039 
0.0089 
0.0183 
0.0347 
0.0606 
0.0984 
0.1497 
0.2148 
0.2920 
0.3784 
0.4695 
0.5606 
0.6472 
0.7255 
0.7931 
0.8490 
0.8933 
0.9269 
0.9514 
0.9687 
0.9805 
0.9881 
0.9930 
0.9960 
0.9978 
0.9988 
0.9994 
0.9997 
0.9998 
0.9999 
1.0000 
1.0000 


20 


0.0000 
0.0000 
0.0000 
0.0000 
0.0000 
0.0001 
0.0003 
0.0008 
0.0021 
0.0050 
0.0108 
0.0214 
0.0390 
0.0661 
0.1049 
0.1565 
0.2211 
0.2970 
0.3814 
0.4703 
0.5591 
0.6437 
0.7206 
0.7875 
0.8432 
0.8878 
0.9221 
0.9475 
0.9657 
0.9782 
0.9865 
0.9919 
0.9953 
0.9973 
0.9985 
0.9992 
0.9996 
0.9998 
0.9999 
1.0000 


Tables of percentage points —— 571 


Table 4: Normal probabilities 


Entry: Probability — h a dt 


x 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 


0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359 
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753 
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141 
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517 
0.4 0.1554 0.1591 0.1627 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879 


0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224 
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2518 0.2549 
0.7 0.2580 0.2612 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852 
0.8 0.2882 0.2910 0.2939 0.2967 0.2996 0.3023 0.3051 0.3079 0.3106 0.3133 
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3290 0.3315 0.3340 0.3365 0.3389 


1.0 0.3414 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621 
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830 
1.2 0.3849 0.3888 0.3888 0.3906 0.3925 0.3943 0.3962 0.3980 0.3997 0.4015 
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4146 0.4162 0.4177 
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4278 0.4292 0.4306 0.4319 


1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441 
1.6 0.4452 0.4453 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545 
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4610 0.4625 0.4633 
1.8 0.4641 0.4648 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706 
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4762 0.4767 


2.0 0.4773 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817 
2.2 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857 
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4914 0.4916 
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4933 0.4934 0.4936 


2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952 
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964 
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974 
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981 
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986 
3.0 0.4986 0.4987 0.4987 0.4988 0.4988 0.4988 0.4989 0.4989 0.4990 0.4990 


572 —— Tables of percentage points 


Table 5: Student-t table, right tail 


Entry: t, , where f f (t,)dt, = a 
and f (t,) is the density of a Student-t with v degrees of freedom 
ν α-010 a=0.05 a=0.025 a=0.01 a=0.005 


1 3.078 6.314 12.706 31.821 63.657 
2 1.886 2.920 4.303 6.965 9.925 
3 1.638 2.353 3.182 4.541 5.841 
4 1.533 2.132 2.776 3.747 4.604 
5 1.476 2.015 2.571 3.365 4.032 
6 1.440 1.943 2.447 3.143 3.707 
7 1.415 1.895 2.365 2.998 3.499 
8 1.397 1.860 2.306 2.896 3.355 
9 1.383 1.833 2.262 2.821 3.250 
10 1.372 1.812 2.228 2.764 3.169 
11 1.363 1.796 2.201 2.718 3.106 
12 1.356 1.782 2.179 2.681 3.055 
13 1.350 1.771 2.160 2.650 3.012 
14 1.345 1.761 2.145 2.624 2.977 
15 1.341 1.753 2.131 2.602 2.947 
16 1.337 1.746 2.120 2.583 2.921 
17 1.333 1.740 2.110 2.567 2.898 
18 1.330 1.734 2.101 2.552 2.878 
19 1.328 1.729 2.093 2.539 2.861 
20 1.325 1.725 2.086 2.528 2.845 
21 1.323 1.721 2.080 2.518 2.831 
22 1.321 1.717 2.074 2.508 2.819 
23 1.319 1.714 2.069 2.500 2.807 
24 1.318 1.711 2.064 2.492 2.797 
25 1.316 1.708 2.060 2.485 2.787 
26 1.315 1.706 2.056 2.479 2.779 
27 1.314 1.703 2.052 2.473 2.771 
28 1.313 1.701 2.048 2.467 2.763 
29 1.311 1.699 2.045 2.462 2.756 


οο 1.282 1.645 1.966 2.326 2.576 


Table 6: The chi-square table, right tail 


Entry: x2, where |; Ff (x2)dy2 =a 
with f (y2) being the density of a chi-square with v degrees of freedom 


Tables of percentage points —— 573 


v 


100 


a = 0.995 


0.0000 
0.0100 
0.0717 
0.2070 


0.412 
0.676 
0.989 
1.34 
1.73 


2.16 
2.60 
3.07 
3.57 
4.07 


4.60 
5.14 
5.70 
6.26 
6.84 


7.43 
8.03 
8.64 
9.26 
9.89 


10.52 
11.16 
11.81 
12.46 
13.12 


13.79 
20.71 
27.99 
35.53 


43.28 
51.17 
59.20 
67.33 


0.99 


0.0002 

00201 
0.1148 
0.2970 


0.5543 
0.872 
1.24 
1.65 
2.09 


2.56 
3.05 
3.57 
4.11 
4.66 


5.23 
5.81 
6.41 
7.01 
7.63 


8.26 
8.90 
9.54 
10.20 
10.86 


11.52 
12.20 
12.88 
13.56 
14.26 


14.95 
22.16 
29.71 
37.48 


45.44 
53.54 
61.75 
70.06 


0.975 


0.0010 
0.0506 
0.2160 
0.4844 


0.831 
1.24 
1.69 
2.18 
2.70 


3.25 
3.82 
4.40 
5.01 
5.63 


6.26 
6.91 
7.56 
8.23 
8.91 


9.59 
10.28 
10.98 
11.69 
12.40 


13.12 
13.84 
14.57 
15.31 
16.05 


16.79 
24.43 
32.36 
40.48 


48.76 
57.15 
65.75 
74.22 


0.95 


0.0039 
0.1030 
0.3520 
0.7110 


1.15 
1.64 
2.17 
2.73 
3.33 


3.94 
4.57 
5.23 
5.89 
6.57 


7.26 
7.96 
8.67 
9.39 
10.12 


10.85 
11.59 
12.34 
13.09 
13.85 


14.61 
15.38 
16.15 
16.93 
17.71 


18.49 
26.51 
34.76 
43.19 


51.74 
60.39 
69.13 
77.93 


0.10 


0.2.71 
0.4.61 
6.25 
7.78 


9.24 
10.64 
12.02 
13.36 
14.68 


15.99 
17.28 
18.55 
19.81 
21.06 


22.31 
23.54 
24.77 
25.99 
27.20 


28.41 
29.62 
30.81 
32.01 
33.20 


34.38 
35.56 
36.74 
37.92 
39.09 


40.26 
51.81 
63.17 
74.40 


85.53 
96.58 
107.6 
118.5 


0.05 


3.84 
5.99 
7.81 
9.49 


11.07 
12.59 
14.07 
15.51 
16.92 


18.31 
19.68 
21.03 
22.36 
23.68 


25.00 
26.30 
27.59 
28.87 
30.14 


31.41 
32.67 
33.92 
35.17 
36.42 


37.65 
38.89 
40.11 
41.34 
42.56 


43.77 
55.76 
67.50 
79.08 


90.53 
101.9 
113.1 
124.3 


0.025 


5.02 
7.38 
9.35 
11.14 


12.83 
14.45 
16.01 
17.53 
19.02 


20.48 
21.92 
23.34 
24.74 
26.12 


27.49 
28.85 
30.19 
31.53 
32.85 


34.17 
35.48 
36.78 
38.08 
39.36 


40.65 
41.92 
43.19 
44.46 
45.72 


46.98 
59.34 
71.42 
83.30 


95.02 
106.6 
118.1 
129.6 


0.01 


6.63 
9.21 
11.34 
13.26 


15.09 
16.81 
18.48 
20.09 
21.67 


23.21 
24.73 
26.22 
27.69 
29.14 


30.58 
32.00 
33.41 
34.81 
36.19 


37.57 
38.93 
40.29 
41.64 
42.98 


44.31 
45.64 
46.96 
48.28 
49.59 


50.89 
63.69 
76.15 
88.38 


100.4 
112.3 
124.1 
135.8 


0.005 


7.88 
10.60 
12.84 
14.86 


16.75 
18.55 
20.28 
21.95 
23.59 


25.19 
26.76 
28.30 
29.82 
31.32 


32.80 
34.27 
35.72 
37.16 
38.58 


40.00 
41.40 
42.80 
44.18 
45.56 


46.93 
48.29 
49.64 
50.99 
52.34 


53.67 
66.77 
79.49 
91.95 


104.2 
116.3 
128.3 
140.2 


0.001 


10.83 
13.81 
16.27 
18.47 


20.52 
22.46 
24.32 
26.12 
27.88 


29.59 
31.26 
32.91 
34.53 
36.12 


37.70 
39.25 
30.79 
42.31 
43.82 


45.31 
46.80 
48.27 
49.73 
51.18 


52.62 
54.05 
55.48 
56.89 
58.30 


59.70 
73.40 
86.66 
99.61 


112.3 
124.8 
137.2 
149.4 


574 —— Tables of percentage points 


Table 7: F-distribution, right tail, 5% points 


Co 


19.5 
8.53 
5.63 


4.36 
3.67 
3.23 
2.93 
2.71 


2.54 
2.40 
2.30 
2.21 
2.13 


2.07 
2.01 
1.96 
1.92 
1.88 


1.84 
1.81 
1.78 
1.76 
1.73 


1.71 
1.69 
1.67 
1.65 
1.64 


1.62 
1.59 
1.57 
1.55 
1.53 


1.51 
1.89 
1.25 


Entry: F, v,,0.05 Where | - fF, v,)dF, να = 0.05 
with f (F, ,,) being the density of F-variable with v, and v, degrees of freedom 
vj Ww-1 2 3 4 5 6 7 8 10 12 24 
1 161.4 199.5 215.7 224.6 230.2 234.0 236.8 238.9 241.9 243.9 249.0 254.3 
2 18. 19.0 19.2 19.2 19.3 19.3 19.4 19.4 19.4 19.4 19.5 
3 10.13 9.55 9.28 9.12 9.01 894 8.89 885 8.79 8.74 8.64 
4 7.71 694 6.59 6.39 6.26 6.16 6.09 6.04 5.96 5.91 5.77 
5 661 5.79 5.41 5.19 505 495 4.88 4.82 4.74 4.68 4.53 
6 599 5.14 4.76 4.53 4.39 428 4.21 4.15 4.06 400 3.84 
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.37 3.64 3.57 3.41 
8 5.32 446 4.07 3.84 3.69 3.58 3.50 3.44 3.35 3.28 3.12 
9 5.12 4.26 3.86 3.63 3.38 3.37 3.29 3.23 3.14 3.07 2.90 
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 2.98 2.91 2.74 
11 484 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.85 2.79 2.61 
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.75 2.69 2.51 
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.67 2.60 2.42 
14 460 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.60 2.53 2.35 
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.54 2.48 2.29 
16 449 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.49 2.42 2.24 
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.45 2.38 2.19 
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.41 2.34 2.15 
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.38 231 2.11 
20 4.35 3.49 3.10 2.87 2.71 2.0 2.51 2.45 2.35 2.28 2.08 
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.32 2.25 2.05 
22 430 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.30 2.23 2.03 
23 4.28 3.42 3.03 2.80 2.64 2.53. 2.44 2.37 2.27 2.20 2.00 
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.25 2.18 1.98 
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.24 2.16 1.96 
26 4.23 3.37 2.98 2.74 2.59 247 2.39 2.32 2.22 2.15 1.95 
27 4.21 3.25 2.96 2.73 2.57 246 2.37 2.31 2.20 243 1.93 
28 4.20 3.34 2.96 2.71 2.56 2.45 2.36 2.29 2.19 2.12 1.91 
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.18 2.10 1.90 
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.16 2.09 1.89 
32 415 3.29 2.90 1.67 2.51 2.40 2.31 2.24 2.14 2.07 1.86 
34 413 3.28 2.88 2.65 2.49 2.38 2.29 2.23 2.12 2.05 1.84 
36 411 3.26 2.87 2.63 2.8 2.6 228 2.21 2.11 2.03 1.82 
38 410 3.24 2.85 2.62 2.46 2.35 2.26 2.19 2.09 2.02 1.81 
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.08 2.00 1.79 
60 4.00 3.15 2.76 2.53 237 2.25 247 2.10 1.99 1.92 1.70 
100 3.92 3.07 2.63 2.45 2.29 2.18 2.09 2.02 1.91 1.83 1.61 
co 3.84 300 2.60 2.37 221 2.10 2.01 1.94 1.83 1.75 1.52 


1.00 


Tables of percentage points —— 575 


Table 8: F -distribution, right tail, 1% points 


Entry: F, 


pV2 


,o.01 Where f FCF, v, )4F v2 = 0.01 


νρνρ»0.01 


with f (F, ,,) being the density of F-variable with v, and v, degrees of freedom 
vj Ww-1 2 3 4 5 6 7 8 10 12 24 οὐ 


1 4052 4999.5 5403 5625 5764 5859 5928 5981 6056 6106 6235 6366 
2 98.5 99.0 99.2 99.2 99.3 9.3 99.4 99.4 99.4 99.4 99.5 99.5 
3 34.1 30.8 29.5 28.7 28.2 27.9 27.7 27.5 27.2 27.1 26.6 26.1 
4 21.2 180 16.7 160 15. 15.2 15.0 148 145 144 13.9 13.5 


5 16.26 13.27 12.06 11.39 10.97 10.67 10.46 10.29 10.05 9.89 9.47 9.02 
6 13.74 10.92 9.78 9.15 8.75 8.47 $8.26 8.10 7.87 7.72 7.31 6.88 
7 12.25 9.55 845 7.85 7.46 7.19 6.99 6.84 6.62 6.47 6.07 5.65 
8 11.26 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.81 5.67 5.28 4.86 
9 10.56 802 6.99 642 6.06 5.80 5.61 5.47 5.26 5.11 4.73 4.31 


10 10.04 7.56 655 5.99 5.64 5.39 5.20 5.06 485 4.71 4.33 3.91 
11 0.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.54 4.40 4.02 3.60 
12 9.33 6.93 5.95 5.41 5.06 482 4.64 4.50 4.30 4.16 3.78 3.36 
13 9.07 6.70 5.74 5.21 4.86 462 4.44 4.30 4.10 3.96 3.59 3.17 
14 8.86 6.51 5.56 5.04 4.70 446 4.28 4.14 3.94 3.80 3.43 3.00 


15 8.68 6.36 542 489 4.56 4.32 4.14 4.00 3.80 3.67 3.29 2.87 
16 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.69 3.55 3.18 2.75 
17 8.40 6.11 5.18 467 4.34 4.10 3.93 3.79 3.59 3.46 3.08 2.65 
18 8.20 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.51 3.37 3.00 2.57 
19 8.18 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.43 3.30 2.92 2.49 


20 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.37 3.23 2.86 2.42 
21 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.31 3.17 2.0 2.36 
22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.26 3.12 2.75 2.31 
23 7.88 5.66 4.76 4.26 3.94 3.71 3.54 3.41 3.21 3.07 2.70 2.26 
24 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.17 3.03 2.66 2.21 


25 7.77 5.57 4.68 4.18 3.86 3.63 3.46 3.32 3.13 2.99 2.62 2.17 
26 7.72 5.53 4.64 4.14 3.82 3.59 3.42 3.29 3.09 2.96 2.58 2.13 
27 7.68 5.49 4.60 4.11 3.78 3.56 3.39 3.26 3.06 2.93 2.55 2.10 
28 7.64 5.45 4.57 4.07 3.75 3.53 3.36 3.23 3.03 2.90 2.52 2.06 
29 7.60 5.42 4.54 4.04 3.73 3.50 3.33 3.20 3.00 2.87 2.49 2.03 


30 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 2.98 2.84 2.47 2.01 
32 7.50 5.34 4.46 3.97 3.65 3.43 3.26 3.13 2.93 2.80 2.42 1.96 
34 7.45 5.29 4.42 3.93 3.61 3.39 3.22 3.09 2.90 2.76 2.38 1.91 
36 7.40 5.25 4.38 3.89 3.58 3.35 3.18 3.05 2.86 2.72 2.35 1.87 
38 7.35 521 4.34 3.86 3.54 3.32 3.15 3.02 2.83 2.69 2.32 1.84 


40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.80 2.66 2.29 1.80 
60 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.63 2.50 2.12 1.60 
120 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.6 2.47 2.334 1.95 1.38 
co 6.63 4.61 3.78 3.32 3.02 2.80 2.64 2.51 2.32 2.18 1.79 1.00 


576 ----- Tables of percentage points 


Table 9: Kolmogorov-Smirnov D, 


Entry: D, , where Pr{D, 2 Dna} = a 


n} a- 0.20 0.15 0.10 0.05 0.01 
1 0.900 0.925 0.950 0.975 0.995 
2 0.684 0.726 0.776 0.842 0.929 
3 0.565 0.597 0.642 0.708 0.828 
4 0.494 0.525 0.564 0.624 0.733 
5 0.446 0.474 0.510 0.565 0.669 
6 0.410 0.436 0.470 0.521 0.618 
7 0.381 0.405 0.438 0.486 0.577 
8 0.358 0.381 0.411 0.457 0.543 
9 0.339 0.360 0.388 0.432 0.514 
10 0.322 0.342 0.368 0.410 0.490 
11 0.307 0.326 0.352 0.391 0.468 
12 0.295 0.313 0.338 0.375 0.450 
13 0.284 0.302 0.325 0.361 0.433 
14 0.274 0.292 0.314 0.349 0.418 
15 0.266 0.283 0.304 0.338 0.404 
16 0.258 0.274 0.295 0.328 0.392 
17 0.250 0.266 0.280 0.318 0.381 
18 0.244 0.259 0.278 0.309 0.371 
19 0.237 0.252 0.272 0.301 0.363 
20 0.231 0.246 0.264 0.294 0.356 
25 0.210 0.220 0.240 0.270 0.320 
30 0.190 0.200 0.220 0.240 0.290 
35 0.180 0.190 0.210 0.230 0.270 


235 10.07//n 1414//n 1.22/Vn 136/νπ 1.63/Vn 


References 


[1] 
[2] 


[3] 


[4] 


[5] 


[6] 


[7] 


[8] 


[9] 


[10] 


[11] 


[12] 


[13] 


[14] 


[15] 


A. M. Mathai, An approximate analysis of the two-way layout, Biometrics 21 (1965), 376-385. 
A. M. Mathai, A Handbook of Generalized Special Functions for Statistical and Physical 
Sciences, Oxford University Press, Oxford, 1993. 

A. M. Mathai, Jacobians of Matrix Transformations and Functions of Matrix Argument, World 
Scientific Publishing, Amsterdam, 1997. 

A. M. Mathai, An Introduction to Geometrical Probability: Distributional Aspects with 
Applications, Gordon and Breach, Amsterdam, 1999. 

A. M. Mathai, A pathway to matrix variate gamma and normal densities, Linear Algebra and Its 
Applications 396 (2005), 317-328. 

A. M. Mathai, Some properties of Mittag-Leffler function and matrix-variate analogues: 

A statistical perspective, Fractional Calculus & Applied Analysis 13(1) (2010), 113-132. 

A. M. Mathai, Stochastic models under power transformations and exponentiation, Journal of 
the Indian Society for Probability and Statistics 13 (2012), 1-19. 

A. M. Mathai and H. J. Haubold, Pathway model, superstatistics, Tsallis statistics and 

a generalized measure of entropy, Physica A 375 (2007), 110-122. 

A. M. Mathai and R. S. Katiyar, A new algorithm for non-linear least squares, Researches 

in Mathematical Statistics 207(10) (1993), 143-147 (in Russian; English translation by the 
American Mathematical Society). 

A. M. Mathai and P. Moschopoulos, On a multivariate beta, Statistica LIIM(2) (1993), 231-241. 
A. M. Mathai and G. Pederzoli, Characterizations of the Normal Probability Law, Wiley Halsted, 
New York and Wiley Eastern, New Delhi, 1977. 

A. M. Mathai and S. B. Provost, Quadratic Forms in Random Variables: Theory and Applications, 
Marcel Dekker, New York, 1992. 

A. M. Mathai, S. B. Provost and T. Hayakawa, Bilinear Forms and Zonal Polynomials, Lecture 
Notes in Statistics, Springer, New York, 1995. 

A. M. Mathai and P. N. Rathie, Basic Concepts in Information Theory and Statistics: Axiomatic 
Foundations and Applications, Wiley, New York and New Delhi, 1975. 

C. R. Rao, Linear Statistical Inference and Its Applications, second edition, Wiley, New York, 
2001. 


@ Open Access. © 2018 Arak M. Mathai, Hans J. Haubold, published by De Gruyter. ABARH] This work is licensed 
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 
https: //doi.org/10.1515/9783110562545-018 


Index 


Analysis of variance (ANOVA) 493 
Analysis of variance principle 496 
ANOVA table 499 


Bayes' estimate 207 

Bayes' procedure 207 

Bayes' rule 55 

Bernoulli probability law 107 
Bernoulli trial 109 

Binomial series 77 

Buffon's clean tile problem 29 
Buffon's needle problem 30 


Cauchy-Schwarz' inequality 460 
Central confidence intervals 358 
Central limit theorem 253 
Characteristic function 101 
Chebyshev's inequality 250 
Chi-square 

—non-central 275 

Chi-square density 139 
Chi-square random variable 140 
Chi-squaredness 287 
Combinations 37 

Complete statistics 343 
Composite hypothesis 383 
Conditional density 176 
Conditional expectation 201 
Conditional probability 48 
Conditional variance 201 
Confidence interval 358 
Confidence region 378 
Consistency of estimators 311 
Correlation analysis 466 
Correlation coefficient 198 
Covariance 187 

Covariance matrix 197 
Cramer-Rao inequality 333 
Critical region 386 


Degrees of freedom 219 
Density 

- beta 3-parameter 163 
- beta non-central 163 
- beta type-1 143 

- beta type-2 144 
-bimodal 164 


— Bose-Einstein 164 
- Cauchy 164 
-= chi 164 
- chi-square 139 
- non-central 164 
— circular normal 164 
— conditional 176 
- Dirichlet type-1 227 
- Dirichlet type-2 230 
-Erlang 164 
— exponential 139 
— extreme value 165 
-F 165 
- Fermi-Dirac 166 
- Fisher's 166 
— Galton's 166 
- gamma 136 
- gamma extended 136 
- gamma generalized 141 
- Gaussian (normal) 147 
- Gram-Charlier 166 
— Helmert's 166 
- inverse Gaussian 166 
-joint 259 
- Laplace 146 
- log-normal 166 
- logistic 162 
- marginal 176 
- Mathai's pathway 161 
— matrix variate gamma 243 
— matrix variate Gaussian or normal 242 
— Maxwell-Boltzmann 167 
- Pearson family 167 
-sech square 167 
- Student-t 277 
-uniform 133 
-rectangular 133 
- Weibull 142 
Density estimation 349 
Design of experiments 493 
Distribution function 65 


Efficiency of estimators 310 
Entropy 57 

Error 

—type-1 384 

—type-2 384 


580 — Index 


Estimable function 308 
Estimation 303 
Estimators 306 

Event 

-sure 10 

Events 9 

- elementary 9 
-impossible 21 

- non-occurrence 11 
Events mutually exclusive 11 
Expected value 83 
Exponential density 139 
Exponential series 76 


Factorial designs 521 
Factorial moment 96 
Fisher's information 332 
Fixed effect models 495 
Fourier transform 103 

F random variable 277 


Gamma density 136 

- generalized 141 

Gamma function 134 

Gaussian or normal density 147 
Geometric probability law 111 
Goodness-of-fit test 420 
Greek-Latin square 517 


Hypergeometric probability law 123 
Hypotheses 

- alternate 384 

- composite 383 

- null 384 

- parametric 382 

- simple 383 

-statistical 382 


lid variables 193 

Incomplete block designs 522 
Independence of events 51 
Independence of random variables 182 
Interaction 505 

Interval estimation 355 


Jacobian 211 
Joint distribution 171 
Joint moment generating function 191 


Kolmogorov-Smirnov test 428 
Kurtosis 159 


Lack-of-fit tests 420 
A-criterion 391 

Laplace density 146 
Laplace transform 103 
Latin square designs 516 
Laws of large numbers 252 
Lepto-kurtic 160 
Likelihood function 259 
Likelihood ratio criterion 391 
Linear expression 284 
Linear forms 284 

Linear regression 450 
Logarithm 78 

Logarithmic series 76 
Logistic model 162 


Marginal density 176 

Markov's inequality 251 

Matrix derivatives 473 

Matrix-variate gamma density 243 

Matrix-variate gamma (real) 245 

Matrix-variate normal or Gaussian density 242 

Mean absolute deviation 91 

Mean value 86 

Median 88 

Method of least squares 477 

Method of maximum likelihood 323 

Method of minimum chi-square 329 

Method of minimum dispersion 330 

Method of moments 321 

Minimal sufficiency 319 

Minimum chi-square method 329 

Minimum dispersion method 330 

MLE (maximum likelihood estimator or estimate) 
324 

Models 

- birth and death 440 

- branching process 439 

— general linear 489 

- non-deterministic 437 

- one-way classification 498 

-random walk 438 

-regression 440 

-time series 440 

Moment generating function 100 

Moment problem 104 

Moments 95 

Most powerful test 387 

Multinomial probability law 221 


Multiple correlation coefficient 459 

Multiple correlation ratio 466 

Multivariate distributions 221 

Multivariate hypergeometric probability law 224 

MVUE (minimum variance unbiased estimator) 
311 


Negative binomial probability law 113 
Neyman factorization theorem 319 
Neyman-Pearson lemma 388 
Non-central chi-square 275 
Non-centrality parameter 275 
Non-deterministic models 437 
Normal density 147 

Normalizing constant 69 
Nuisance parameters 368 

Null hypothesis 383 
Null/non-null distribution 416 


One-way classification model 498 
Order statistics 291 
Orthogonal Latin squares 517 


Parameter 69 
Parameter space 305 
Parametric estimation 305 
Parametric hypotheses 382 
Partitioned matrices 461 
Pathway model 161 
Pearson’s chi-square test 420 
Permutation 33 
Pivotal quantities 356 
Plati-kurtic 160 
Pochhammer symbol 36 
Point estimation 303 
Poisson distribution 117 
Power curve 387 
Power of atest 386 
Prediction 204 
Probability 18 
— conditional 49 
- postulates 18 
Probability function 66 
Probability generating function 130 
Probability law 
- discrete 

- Bernoulli 107 

— beta-binomial 126 

- beta-Pascal 126 


Index 


-binomial 109 

- Borel-Tanner 126 

- discrete hypergeometric 123 

- Dodge-Romig 127 

— Engset 127 

— geometric 111 

— Hermite 127 

- Hyper-Poisson 127 

- hypergeometric 124 

- logarithmic 127 

- negative binomial 113 

- Neyman 128 

— Pascal 128 

— Poisson 117 

- Polya 128 

—- Short's 128 
Product probability property 51 
p-values 398 


Quadratic expression 285 
Quadratic forms 285 


Random cut 63 

Random effect model 522 
Random experiment 7 
Random variable 61 

- continuous 65 

— degenerate 69 

- discrete 65 

Random walk model 438 
Randomized block design 503 
Randomized experiments 494 
Ranktest 431 

Rectangular density 133 
Regression 206 

- linear 450 

Regression type models 440 
Relative efficiency 310 
Response surface analysis 522 
Run test 433 


Sample mean 261 

Sample space 8 

Sample space partitioning 16 
Sample variance 261 
Sampling distribution 255 
Set 2 

- equality 3 

- intersection 4 


— 5581 


582 — Index 


- subset 4 

- complement 5 
- union 4 
Sign test 430 
Simple hypothesis 383 
Standard deviation 88 
Statistic 261 
Statistical hypotheses 382 
Statistical independence 51 
Student-t statistic 277 
Sufficient statistic 314 


Tables of percentage points 559 
Test for no association 425 

Test for randomness 434 

Total probability law 53 


Transformation of variables 153, 211 
Two-way classification 505 

Type-1 beta density 143 
Type-1error 384 

Type-2 beta density 143 

Type-2 error 384 


Unbiasedness 256 

Unbiasedness of estimators 307 
Uniform density 133 

Uniformly most powerful tests 387 


Variance 87 
Vector derivative 473 
Venn diagram 14 


